Academia to Industry

Last week, Brian Keegan had a great post on moving from doctoral studies to industrial data science. If you have not yet read it, go read the whole thing. In this post I will share a couple of my favorite parts of the post, as well as one area where I strongly disagreed with Brian.

The first key point of the post is to obtain relevant, marketable skills while you are in grad school. There’s just no excuse not to, regardless of your field of study–taking classes and working with scholars in other departments is almost always allowed and frequently encouraged. As Brian puts it:

[I]f you spend 4+ years in graduate school without ever taking classes that demand general programming and/or data analysis skills, I unapologetically believe that your very real illiteracy has held you back from your potential as a scholar and citizen.

Another great nugget in the post is in the context of recruiters, but it is also very descriptive of a prevailing attitude in academia:

This [realizing recruiters’ self-interested motivations] is often hard for academics who have come up through a system that demands deference to others’ agendas under the assumption they have your interests at heart as future advocates.

The final point from the post that I want to discuss may be very attractive and comforting to graduate students doing industry interviews for the first time:

After 4+ years in a PhD program, you’ve earned the privilege to be treated better than the humiliation exercises 20-year old computer science majors are subjected to for software engineering internships.

My response to this is, “no, you haven’t.” This is for exactly the reasons mentioned above–that many graduate students can go through an entire curriculum without being able to code up FizzBuzz. A coding interview is standard for junior and midlevel engineers, even if they have a PhD. Frankly, there are a lot of people trying to pass themselves off as data scientists who can’t code their way out of a paper bag, and a coding interview is a necessary screen. Think of it as a relatively low threshold that greatly enhances the signal-to-noise ratio for the interviewer. If you’re uncomfortable coding in front of another person, spend a few hours pairing with a friend and getting their feedback on your code. Interviewers know that coding on a whiteboard or in a Google Doc is not the most natural environment, and should be able to calibrate for this.

With this one caveat, I heartily recommend the remainder of the original post. This is an interesting topic, and you can expect to hear more about it here in the future.

Epstein on Athletes

As a follow-up to the most recent series of posts, you may enjoy this TED talk by David Epstein. Epstein is the author of The Sports Gene and offered the claim that kicked off those earlier posts–that he could accurately guess an Olympian’s sport knowing only her height and weight.

The talk offers some additional context for Epstein’s claim. Specifically Epstein describes how the average height and weight in a set of 24 sports has become more different over time:

In the early half of the 20th century, physical education instructors and coaches had the idea that the average body type was the best for all athletic endeavors: medium height, medium weight, no matter the sport. And this showed in athletes’ bodies. In the 1920s, the average elite high-jumper and average elite shot-putter were the same exact size. But as that idea started to fade away, as sports scientists and coaches realized that rather than the average body type, you want highly specialized bodies that fit into certain athletic niches, a form of artificial selection took place, a self-sorting for bodies that fit certain sports, and athletes’ bodies became more different from one another. Today, rather than the same size as the average elite high jumper, the average elite shot-putter is two and a half inches taller and 130 pounds heavier. And this happened throughout the sports world.

Here’s the chart used to support that point, with data points from the early twentieth century in yellow and more recent data points in blue:

Average height and mass for athletes in 24 sports in the early twentieth century (yellow) and today (blue)

Average height and mass for athletes in 24 sports in the early twentieth century (yellow) and today (blue)

This suggests that it has become easier over time to guess individuals’ sports based on physical characteristics, but as we saw it is still difficult to do with a high degree of accuracy.

Another interesting change highlighted in the talk is the role of technology:

In 1936, Jesse Owens held the world record in the 100 meters. Had Jesse Owens been racing last year in the world championships of the 100 meters, when Jamaican sprinter Usain Bolt finished, Owens would have still had 14 feet to go…. [C]onsider that Usain Bolt started by propelling himself out of blocks down a specially fabricated carpet designed to allow him to travel as fast as humanly possible. Jesse Owens, on the other hand, ran on cinders, the ash from burnt wood, and that soft surface stole far more energy from his legs as he ran. Rather than blocks, Jesse Owens had a gardening trowel that he had to use to dig holes in the cinders to start from. Biomechanical analysis of the speed of Owens’ joints shows that had been running on the same surface as Bolt, he wouldn’t have been 14 feet behind, he would have been within one stride. 

The third change Epstein discusses is more dubious: a “changing mindset” among athletes giving them a “can do” attitude. In particular he mentions Roger Bannister’s four-minute mile as a major psychological breakthrough in sporting. As this interview makes clear, Bannister attributes the fact that no progress was made in the fastest mile time between 1945 and 1954 to the destruction, rationing, and overall quite distracting events of WWII. It’s possible that a four-minute mile was run as early as 1770. I wonder what Epstein’s claims would look like on that time scale?

Classifying Olympic Athletes by Sport and Event (Part 3)

This is the last post in a three-part series. Part one, describing the data, is here. Part two gives an overview of the machine learning methods and can be found here. This post presents the results.

To present the results I will use classification matrices, transformed into heatmaps. The rows indicate Olympians’ actual sports, and the columns are their predicted sports. A dark value on the diagonal indicates accurate predictions (the athlete is predicted to be in their actual sport) while light values on the diagonal suggest that Olympians in a certain sport are misclassified by the algorithms used. In each case results for the training set are in the left column and results for the test set are on the right. For a higher resolution version, see this pdf.

Classifying Athletes by Sport

sport-matrices

 

For most rows, swimming is the most common predicted sport. That’s partially because there are so many swimmers in the data and partially due to the fact that swimmers have a fairly generic body type as measured by height and weight (see the first post). With more features such as arm length and torso length we could better distinguish between swimmers and non-swimmers.

Three out of the four methods perform similarly. The real oddball here is random forest: it classifies the training data very well, but does about as well on the test data as the other methods. This suggests that random forest is overfitting the data, and won’t give us great predictions on new data.

Classifying Athletes by Event

event-matrices

The results here are similar to the ones above: all four methods do about equally well for the test data, while random forest overfits the training data. The two squares in each figure represent male and female sports. This is a good sanity check–at least our methods aren’t misclassifying men into women’s events or vice versa (recall that sex is one of the four features used for classification).

Accuracy

Visualizations are more helpful than looking at a large table of predicted probabilities, but what are the actual numbers? How accurate are the predictions from these methods? The table below presents accuracy for both tasks, for training and test sets.

accuracy

The various methods classify Olympians into sports and events with about 25-30 percent accuracy. This isn’t great performance. Keep in mind that we only had four features to go on, though–with additional data about the participants we could probably do better.

After seeing these results I am deeply skeptical that David Epstein could classify Olympians by event using only their height and weight. Giving him the benefit of the doubt, he probably had in mind the kind of sports and events that we saw were easy to classify: basketball, weightlifting, and high jump, for example. These are the types of competitions that The Sports Gene focuses on. As we have seen, though, there is a wide range of sporting events and a corresponding diversity of body types. Being naturally tall or strong doesn’t hurt, by it also doesn’t automatically qualify you for the Olympics. Training and hard work play an important role, and Olympic athletes exhibit a wide range of physical characteristics.

Classifying Olympic Athletes by Sport and Event (Part 2)

This is the second post in a three-part series. The first post, giving some background and describing the data, is here. In that post I pointed out David Epstein’s claim that he could identify an Olympian’s event knowing only her height and weight. The sheer number of Olympians–about 10,000–makes me skeptical, but I decided to see whether machine learning could accurately produce the predictions Mr. Epstein claims he could.

To do this, I tried four different machine learning methods. These are all well-documented methods implemented in existing R packages. Code and data for is here (for sports) and here (for events).

The first two methods, conditional inference trees (using the party package) and evolutionary trees (using evtree), are both decision tree-based approaches. That means that they sequentially split the data based on binary decisions. If the data falls on one side of the split (say, height above 1.8 meters) you continue down one fork of the tree, and if not you go down the other fork. The difference between these two methods is how the tree is formed: the first recursively partitions the data based on conditional probability, while the second method (as the name suggests) uses an evolutionary algorithm. To get a feel for how this actually divides the data, see the figure below and this post.

 

If a single tree is good, a whole forest must be better–or at least that’s the thinking behind random forests, the third method I used. This method generates a large number of trees (500 in this case), each of which has access to only some of the features in the data. Once we have a whole forest of trees, we combine their predictions (usually through a voting process). The combination looks a little bit like the figure below, and a good explanation is here.

 

The fourth and final method used–artificial neural networks–is a bit harder to visualize. Neural networks are sort of a black box, making them difficult to interpret and explain. At a coarse level they are intended to work like neurons in the brain: take some input, and produce output based on whether the input crosses a certain threshold. The neural networks I used have a single hidden layer with 30 (for sports classification) or 50 hidden nodes (for event classification). To get a better feel for how neural networks work, see this three part series.

That’s a very quick overview of the four machine learning methods that I applied to classifying Olympians by sport and event. The data and R code are available at the link above. In the next post, scheduled for Friday, I’ll share the results.

Classifying Olympic Athletes by Sport and Event (Part 1)

Note: This post is the first in a three-part series. It describes the motivation for this project and the data used. When parts two and three are posted I will link to them here.

Can you predict which sport or event an Olympian competes in based solely on her height, weight, age and sex? If so, that would suggest that physical features strongly drive athletes’ relative abilities across sports, and that they pick sports that best leverage their physical predisposition. If not, we might infer that athleticism is a latent trait (like “grit“) that can be applied to the sport of one’s choice.

SportsGeneDavid Epstein argues that sporting success is largely based on heredity in his book, The Sports Gene. To support his argument, he describes how elite athletes’ physical features have become more specialized to their sport over time (think Michael Phelps). At a basic level Epstein is correct: males and females differ at both a genetic level and in their physical features, generally speaking.

However, Epstein advanced a stronger claim in an interview (at 29:46) with Russ Roberts:

Roberts: [You argue that] if you simply had the height and weight of an Olympic roster, you could do a pretty good job of guessing what their events are. Is that correct?

Epstein: That’s definitely correct. I don’t think you would get every person accurately, but… I think you would get the vast majority of them correctly. And frankly, you could definitely do it easily if you had them charted on a height-and-weight graph, and I think you could do it for most positions in something like football as well.

I chose to assess Epstein’s claim in a project for a machine learning course at Duke this semester. The data was collected by The Guardian, and includes all participants for the 2012 London Summer Olympics. There was complete data on age, sex, height, and weight for 8,856 participants, excluding dressage (an oddity of the data is that every horse-rider pair was treated as the sole participant in a unique event described by the horse’s name). Olympians participate in one or more events (fairly specific competitions, like a 100m race), which are nested in sports (broader categories such as “Swimming” or “Athletics”).

Athletics is by far the largest sport category (around 20 percent of athletes), so when it was included it dominated the predictions. To get more accurate classifications, I excluded Athletics participants from the sport classification task. This left 6,956 participants in 27 sports, split into a training set of size 3,520 and a test set of size 3,436. The 1,900 Athletics participants were classified into 48 different events, and also split into training (907 observations) and test sets (993 observations). For athletes participating in more than one event, only their first event was used.

What does an initial look at the data tell us? The features of athletes in some sports (Basketball, Rowing, Weightlifting, and Wrestling) and events (100m hurdles, Hammer throw, High jump, and Javelin) exhibit strong clustering patters. This makes it relatively easy to guess a participant’s sport or event based on her features. In other sports (Archery, Swimming, Handball, Triathlon) and events (100m race, 400m hurdles, 400m race, and Marathon) there are many overlapping clusters making classification more difficult.

sport-descriptive

Well-defined (left) and poorly-defined clusters of height and weight by sport.

Well-defined (left) and poorly-defined clusters of height and weight by event.

Well-defined (left) and poorly-defined clusters of height and weight by event.

The next post, scheduled for Wednesday, will describe the machine learning methods I applied to this problem. The results will be presented on Friday.

What Really Happened to Nigeria’s Economy?

You may have heard the news that the size Nigeria’s economy now stands at nearly $500 billion. Taken at face value (as many commenters have seemed all to happy to do) this means that the West African state “overtook” South Africa’s economy, which was roughly $384 billion in 2012. Nigeria’s reported GDP for that year was $262 billion, meaning it roughly doubled in a year.

How did this “growth” happen? As Bloomberg reported:

On paper, the size of the economy expanded by more than three-quarters to an estimated 80 trillion naira ($488 billion) for 2013, Yemi Kale, head of the National Bureau of Statistics, said at a news conference yesterday to release the data in the capital, Abuja….

The NBS recalculated the value of GDP based on production patterns in 2010, increasing the number of industries it measures to 46 from 33 and giving greater weighting to sectors such as telecommunications and financial services.

The actual change appears to be due almost entirely to Nigeria including figures in GDP calculation that had been excluded previously. There is nothing wrong with this, per se, but it makes comparisons completely unrealistic. This would be like measuring your height in bare feet for years, then doing it while wearing platform shoes. Your reported height would look quite different, without any real growth taking place. Similar complications arise when comparing Nigeria’s new figures to other countries’, when the others have not changed their methodology.

Nigeria’s recalculation adds another layer of complexity to the problems plaguing African development statistics. Lack of transparency (not to mention accuracy) in reporting economic activity makes decisions about foreign aid and favorable loans more difficult. For more information on these problems, see this post discussing Morten Jerven’s book Poor NumbersIf you would like to know more about GDP and other economic summaries, and how they shape our world, I would recommend Macroeconomic Patterns and Stories (somewhat technical), The Leading Indicators, and GDP: A Brief but Affectionate History.

Schneier on Data and Power

Data and Power is the tentative title of a new book, forthcoming from Bruce Schneier. Here’s more from the post describing the topic of the book:

Corporations are collecting vast dossiers on our activities on- and off-line — initially to personalize marketing efforts, but increasingly to control their customer relationships. Governments are using surveillance, censorship, and propaganda — both to protect us from harm and to protect their own power. Distributed groups — socially motivated hackers, political dissidents, criminals, communities of interest — are using the Internet to both organize and effect change. And we as individuals are becoming both more powerful and less powerful. We can’t evade surveillance, but we can post videos of police atrocities online, bypassing censors and informing the world. How long we’ll still have those capabilities is unclear….

There’s a fundamental trade-off we need to make as society. Our data is enormously valuable in aggregate, yet it’s incredibly personal. The powerful will continue to demand aggregate data, yet we have to protect its intimate details. Balancing those two conflicting values is difficult, whether it’s medical data, location data, Internet search data, or telephone metadata. But balancing them is what society needs to do, and is almost certainly the fundamental issue of the Information Age.

There’s more at the link, including several other potential titles. The topic will likely interest many readers of this blog. It will likely build on his ideas of inequality and online feudalism, discussed here.

“The Impact of Leadership Removal on Mexican Drug Trafficking Organizations”

That’s the title of a new article, now online at the Journal of Quantitative Criminology. Thanks to fellow grad students Cassy Dorff and Shahryar Minhas for their feedback. Thanks also to mentors at the University of Houston (Jim Granato, Ryan Kennedy) and Duke University (Michael D. Ward, Scott de Marchi, Guillermo Trejo) for thoughtful comments. The anonymous reviewers at JQC and elsewhere were also a big help.

Here is the abstract:

Objectives

Has the Mexican government’s policy of removing drug-trafficking organization (DTO) leaders reduced or increased violence? In the first 4 years of the Calderón administration, over 34,000 drug-related murders were committed. In response, the Mexican government captured or killed 25 DTO leaders. This study analyzes changes in violence (drug-related murders) that followed those leadership removals.

Methods

The analysis consists of cross-sectional time-series negative binomial modeling of 49 months of murder counts in 32 Mexican states (including the federal district).

Results

Leadership removals are generally followed by increases in drug-related murders. A DTO’s home state experiences more subsequent violence than the state where the leader was removed. Killing leaders is associated with more violence than capturing them. However, removing leaders for whom a $30m peso bounty was offered is associated with a smaller increase than other removals.

Conclusion

DTO leadership removals in Mexico were associated with an estimated 415 additional deaths during the first 4 years of the Calderón administration. Reforming Mexican law enforcement and improving career prospects for young men are more promising counter-narcotics strategies. Further research is needed to analyze how the rank of leaders mediates the effect of their removal.

I didn’t shell out $3,000 for open access, so the article is behind a paywall. If you’d like a draft of the manuscript just email me.

Mexico Update Following Joaquin Guzmán’s Capture

As you probably know by now, the Sinaloa cartel’s leader Joaquin Guzmán was captured in Mexico last Saturday. How will violence in Mexico shift following Guzman’s removal?

(Alfredo Estrella/AFP/Getty Images)

(Alfredo Estrella/AFP/Getty Images)

I take up this question in an article forthcoming in the Journal of Quantitative Criminology. According to that research (which used negative binomial modeling on a cross-sectional time series of Mexican states from 2006 to 2010), DTO leadership removals in Mexico are generally followed by increased violence. However, capturing leaders is associated with less violence than killing them. The removal of leaders for whom a 30 million peso bounty (the highest in my dataset, which generally identified high-level leaders) been offered is also associated with less violence. The reward for Guzmán’s capture was higher than any other contemporary DTO leader: 87 million pesos. Given that Guzmán was a top-level leader and was arrested rather than killed, I would not expect a significant uptick in violence (in the next 6 months) due to his removal. This follows President Pena Nieto’s goal of reducing DTO violence.

My paper was in progress for a while, so the data is a few years old. Fortunately Brian Phillips has also taken up this question using additional data and similar methods, and his results largely corroborate mine:

Many governments kill or capture leaders of violent groups, but research on consequences of this strategy shows mixed results. Additionally, most studies have focused on political groups such as terrorists, ignoring criminal organizations – even though they can represent serious threats to security. This paper presents an argument for how criminal groups differ from political groups, and uses the framework to explain how decapitation should affect criminal groups in particular. Decapitation should weaken organizations, producing a short-term decrease in violence in the target’s territory. However, as groups fragment and newer groups emerge to address market demands, violence is likely to increase in the longer term. Hypotheses are tested with original data on Mexican drug-trafficking organizations (DTOs), 2006-2012, and results generally support the argument. The kingpin strategy is associated with a reduction of violence in the short term, but an increase in violence in the longer term. The reduction in violence is only associated with leaders arrested, not those killed.

A draft of the full paper is here.

Who says North is “up”?

There are several childhood lessons that I trace back to dinners at Outback Steakhouse: the deliciousness of cheese fries, the inconvenience of being in the middle of a wraparound booth, and the historical contingency of North as “up” on maps.
Upside_Down_World_Map

Who started using the NESW arrangement that is virtually omnipresent on maps today? Was it due to the fact that civilization as we now know it developed in the Northern hemisphere? (Incidentally, that’s why clocks run clockwise–a sundial in the Southern hemisphere goes the other way around.)

That doesn’t appear to be the case according to Nick Danforth, who recently took on this question at al-Jazeera America (via Flowing Data):

There is nothing inevitable or intrinsically correct — not in geographic, cartographic or even philosophical terms — about the north being represented as up, because up on a map is a human construction, not a natural one. Some of the very earliest Egyptian maps show the south as up, presumably equating the Nile’s northward flow with the force of gravity. And there was a long stretch in the medieval era when most European maps were drawn with the east on the top. If there was any doubt about this move’s religious significance, they eliminated it with their maps’ pious illustrations, whether of Adam and Eve or Christ enthroned. In the same period, Arab map makers often drew maps with the south facing up, possibly because this was how the Chinese did it.

So who started putting North up top? According to Danforth, that was Ptolemy:

[He] was a Hellenic cartographer from Egypt whose work in the second century A.D. laid out a systematic approach to mapping the world, complete with intersecting lines of longitude and latitude on a half-eaten-doughnut-shaped projection that reflected the curvature of the earth. The cartographers who made the first big, beautiful maps of the entire world, Old and New — men like Gerardus MercatorHenricus Martellus Germanus and Martin Waldseemuller — were obsessed with Ptolemy. They turned out copies of Ptolemy’s Geography on the newly invented printing press, put his portrait in the corners of their maps and used his writings to fill in places they had never been, even as their own discoveries were revealing the limitations of his work.

map_projectionsPtolemy probably had his reasons, but they are lost to history. As Danforth concludes, “The orientation of our maps, like so many other features of the modern world, arose from the interplay of chance, technology and politics in a way that defies our desire to impose easy or satisfying narratives.” Yet another example of a micro-institution that rules our world.