Academia to Industry

Last week, Brian Keegan had a great post on moving from doctoral studies to industrial data science. If you have not yet read it, go read the whole thing. In this post I will share a couple of my favorite parts of the post, as well as one area where I strongly disagreed with Brian.

The first key point of the post is to obtain relevant, marketable skills while you are in grad school. There’s just no excuse not to, regardless of your field of study–taking classes and working with scholars in other departments is almost always allowed and frequently encouraged. As Brian puts it:

[I]f you spend 4+ years in graduate school without ever taking classes that demand general programming and/or data analysis skills, I unapologetically believe that your very real illiteracy has held you back from your potential as a scholar and citizen.

Another great nugget in the post is in the context of recruiters, but it is also very descriptive of a prevailing attitude in academia:

This [realizing recruiters’ self-interested motivations] is often hard for academics who have come up through a system that demands deference to others’ agendas under the assumption they have your interests at heart as future advocates.

The final point from the post that I want to discuss may be very attractive and comforting to graduate students doing industry interviews for the first time:

After 4+ years in a PhD program, you’ve earned the privilege to be treated better than the humiliation exercises 20-year old computer science majors are subjected to for software engineering internships.

My response to this is, “no, you haven’t.” This is for exactly the reasons mentioned above–that many graduate students can go through an entire curriculum without being able to code up FizzBuzz. A coding interview is standard for junior and midlevel engineers, even if they have a PhD. Frankly, there are a lot of people trying to pass themselves off as data scientists who can’t code their way out of a paper bag, and a coding interview is a necessary screen. Think of it as a relatively low threshold that greatly enhances the signal-to-noise ratio for the interviewer. If you’re uncomfortable coding in front of another person, spend a few hours pairing with a friend and getting their feedback on your code. Interviewers know that coding on a whiteboard or in a Google Doc is not the most natural environment, and should be able to calibrate for this.

With this one caveat, I heartily recommend the remainder of the original post. This is an interesting topic, and you can expect to hear more about it here in the future.

What Really Happened to Nigeria’s Economy?

You may have heard the news that the size Nigeria’s economy now stands at nearly $500 billion. Taken at face value (as many commenters have seemed all to happy to do) this means that the West African state “overtook” South Africa’s economy, which was roughly $384 billion in 2012. Nigeria’s reported GDP for that year was $262 billion, meaning it roughly doubled in a year.

How did this “growth” happen? As Bloomberg reported:

On paper, the size of the economy expanded by more than three-quarters to an estimated 80 trillion naira ($488 billion) for 2013, Yemi Kale, head of the National Bureau of Statistics, said at a news conference yesterday to release the data in the capital, Abuja….

The NBS recalculated the value of GDP based on production patterns in 2010, increasing the number of industries it measures to 46 from 33 and giving greater weighting to sectors such as telecommunications and financial services.

The actual change appears to be due almost entirely to Nigeria including figures in GDP calculation that had been excluded previously. There is nothing wrong with this, per se, but it makes comparisons completely unrealistic. This would be like measuring your height in bare feet for years, then doing it while wearing platform shoes. Your reported height would look quite different, without any real growth taking place. Similar complications arise when comparing Nigeria’s new figures to other countries’, when the others have not changed their methodology.

Nigeria’s recalculation adds another layer of complexity to the problems plaguing African development statistics. Lack of transparency (not to mention accuracy) in reporting economic activity makes decisions about foreign aid and favorable loans more difficult. For more information on these problems, see this post discussing Morten Jerven’s book Poor NumbersIf you would like to know more about GDP and other economic summaries, and how they shape our world, I would recommend Macroeconomic Patterns and Stories (somewhat technical), The Leading Indicators, and GDP: A Brief but Affectionate History.

Mexico Update Following Joaquin Guzmán’s Capture

As you probably know by now, the Sinaloa cartel’s leader Joaquin Guzmán was captured in Mexico last Saturday. How will violence in Mexico shift following Guzman’s removal?

(Alfredo Estrella/AFP/Getty Images)

(Alfredo Estrella/AFP/Getty Images)

I take up this question in an article forthcoming in the Journal of Quantitative Criminology. According to that research (which used negative binomial modeling on a cross-sectional time series of Mexican states from 2006 to 2010), DTO leadership removals in Mexico are generally followed by increased violence. However, capturing leaders is associated with less violence than killing them. The removal of leaders for whom a 30 million peso bounty (the highest in my dataset, which generally identified high-level leaders) been offered is also associated with less violence. The reward for Guzmán’s capture was higher than any other contemporary DTO leader: 87 million pesos. Given that Guzmán was a top-level leader and was arrested rather than killed, I would not expect a significant uptick in violence (in the next 6 months) due to his removal. This follows President Pena Nieto’s goal of reducing DTO violence.

My paper was in progress for a while, so the data is a few years old. Fortunately Brian Phillips has also taken up this question using additional data and similar methods, and his results largely corroborate mine:

Many governments kill or capture leaders of violent groups, but research on consequences of this strategy shows mixed results. Additionally, most studies have focused on political groups such as terrorists, ignoring criminal organizations – even though they can represent serious threats to security. This paper presents an argument for how criminal groups differ from political groups, and uses the framework to explain how decapitation should affect criminal groups in particular. Decapitation should weaken organizations, producing a short-term decrease in violence in the target’s territory. However, as groups fragment and newer groups emerge to address market demands, violence is likely to increase in the longer term. Hypotheses are tested with original data on Mexican drug-trafficking organizations (DTOs), 2006-2012, and results generally support the argument. The kingpin strategy is associated with a reduction of violence in the short term, but an increase in violence in the longer term. The reduction in violence is only associated with leaders arrested, not those killed.

A draft of the full paper is here.

Visualizing the Indian Buffet Process with Shiny

(This is a somewhat more technical post than usual. If you just want the gist, skip to the visualization.)

N customers enter an Indian buffet restaurant, one after another. It has a seemingly endless array of dishes. The first customer fills her plate with a Poisson(α) number of dishes. Each successive customer i tastes the previously sampled dishes in proportion to their popularity (the number of previous customers who have sampled the kth dish, m_k, divided by i). The ith customer then samples a Poisson(α) number of new dishes.

That’s the basic idea behind the Indian Buffet Process (IBP). On Monday Eli Bingham and I gave a presentation on the IBP in our machine learning seminar at Duke, taught by Katherine Heller. The IBP is used in Bayesian non-parametrics to put a prior on (exchangeability classes of) binary matrices. The matrices usually represent the presence of features (“dishes” above, or the columns of the matrix) in objects (“customers,” or the rows of the matrix). The culinary metaphor is used by analogy to the Chinese Restaurant Process.

Although the visualizations in the main paper summarizing the IBP are good, I thought it would be helpful to have an interactive visualization where you could change α and N to see how what a random matrix with those parameters looks like. For this I used Shiny, although it would also be fun to do in d3.

One realization of the IBP, with α=10.

One realization of the IBP, with α=10.

In the example above, the first customer (top row) sampled seven dishes. The second customer sampled four of those seven dishes, and then four more dishes that the first customer did not try. The process continues for all 10 customers. (Note that this matrix is not sorted into its left-ordered-form. It also sometimes gives an error if α << N, but I wanted users to be able to choose arbitrary values of N so I have not changed this yet.) You can play with the visualization yourself here.

Interactive online visualizations like this can be a helpful teaching tool, and the process of making them can also improve your own understanding of the process. If you would like to make another visualization of the IBP (or another machine learning tool that lends itself to graphical representation) I would be happy to share it here. I plan to add the Chinese restaurant process and a Dirichlet process mixture of Gaussians soon. You can find more about creating Shiny apps here.

What Can We Learn from Games?

ImageThis holiday season I enjoyed giving, receiving, and playing several new card and board games with friends and family. These included classics such as cribbage, strategy games like Dominion and Power Grid, and the whimsical Munchkin.

Can video and board games teach us more than just strategy? What if games could teach us not to be better thinkers, but just to be… better? A while ago we discussed how monopoly was originally designed as a learning experience to promote cooperation. Lately I have learned of two other such games in a growing genre and wanted to share them here.

The first is Depression Quest by Zoe Quinn (via Jeff Atwood):

Depression Quest is an interactive fiction game where you play as someone living with depression. You are given a series of everyday life events and have to attempt to manage your illness, relationships, job, and possible treatment. This game aims to show other sufferers of depression that they are not alone in their feelings, and to illustrate to people who may not understand the illness the depths of what it can do to people.

The second is Train by Brenda Romero (via Marcus Montano) described here with spoilers:

In the game, the players read typewritten instructions. The game board is a set of train tracks with box cars, sitting on top of a window pane with broken glass. There are little yellow pegs that represent people, and the player’s job is to efficiently load those people onto the trains. A typewriter sits on one side of the board.

The game takes anywhere from a minute to two hours to play, depending on when the players make a very important discovery. At some point, they turn over a card that has a destination for the train. It says Auschwitz. At that point, for anyone who knows their history, it dawns on the player that they have been loading Jews onto box cars so they can be shipped to a World War II concentration camp and be killed in the gas showers or burned in the ovens.

The key emotion that Romero said she wanted the player to feel was “complicity.”

“People blindly follow rules,” she said. “Will they blindly follow rules that come out of a Nazi typewriter?”

I have tried creating my own board games in the past, and this gives me renewed interest and a higher standard. What is the most thought-provoking moment you have experienced playing games?

Two Great Talks on Government and Technology

If you are getting ready to travel next week, you might want to have a couple of good talks/podcasts handy for the trip. Here are two that I enjoyed, on the topic of government and technology.

The first is about how technology can help governments. Ben Orenstein of “Giant Robots Smashing Into Other Giant Robots” discusses Code for America with Catherine Bracy. Catherine recounts some ups and downs of CfA’s partnerships with cities throughout America and internationally. CfA fellows commit a year to help local governments with challenges amenable to technology. One great example that the podcast discusses is a tool for parents in Boston to see which schools they could send their kids to when the city switched from location-based school assignment to allowing students to attend schools throughout the city. (Incidentally, the school matching algorithm that Boston used was designed by some professors in economics at Duke, who drew on work for which Roth and Shapley won the Nobel Prize.)

The second talk offers another point of view on techno-politics: when government abuses technology. Steve Klabnik‘s “No Secrets Allowed” talk from Golden Gate Ruby Conference discusses recent revelations regarding the NSA and privacy. In particular he explains why “I have nothing to hide” is not an appropriate response. The talk is not entirely hopeless, and includes recommendations such as using Tor. The Ruby Rogues also had a roundtable discussing Klabnik’s presentation, which you can find here.

Other recommendations are welcome.

The Economics of Movie Popcorn

The Smithsonian’s Food & Think blog recounts a long history of movie theaters’ objections to popcorn. They wanted to be as classy as live theaters. Nickelodeons didn’t have ventilation required for popcorn machines. Moreover, crunchy snacks would have been unwelcome during silent films.

But moviegoers still wanted their popcorn, and street vendors met their demand. This led to signs asking patrons to check their coats and their corn at the theater entrance.

Eventually, movie theater owners realized that if they cut out the middleman, their profits would skyrocket.  For many theaters, the transition to selling snacks helped save them from the crippling Depression. In the mid-1930s, the movie theater business started to go under. “But those that began serving popcorn and other snacks,” Smith explains, “survived.” Take, for example, a Dallas movie theater chain that installed popcorn machines in 80 theaters, but refused to install machines in their five best theaters, which they considered too high class to sell popcorn. In two years, the theaters with popcorn saw their profits soar; the five theaters without popcorn watched their profits go into the red.

Much more here, including how movie theater demand changed the types of popcorn that are grown.

PopcornPortionSizeExample

Popcorn and other concessions are important to theaters because a large percentage of ticket sales (especially during the first couple of weeks after a movie premieres) go to the studio. Recent figures I’ve seen are that concession sales are 80-90 percent profit, whereas in the opening weekend only about 20 percent of the sale price goes to the theater. This means that concessions can make up nearly half the profit for a theater–no wonder they try to keep viewers from bringing in their own refreshments.

Small bags of popcorn have now turned into buckets, perhaps in an effort to justify charging $8-10 rather than the nickel such snacks sold for when “talkies” were new. This transition is covered in the book Why Popcorn Costs So Much at the Movies and an interview with the author is here.

 

Visualizing the BART Labor Dispute

Labor disputes are complicated, and the BART situation is no different. Negotiations resumed this week after the cooling off period called for by the governor of California as a result of the July strikes.

To help get up to speed, check out the data visualizations made by the Bay Area d3 User Group in conjunction with the UC Berkeley VUDLab.  They have a round up of news articles, open data, and open source code, as well as links to all the authors’ Twitter profiles.

The infographics address several key questions relevant to the debate, including how much BART employees earn, who rides BART and where, and the cost of living for BART employees.

bart-salary

bart-ridership

More here.

Review: RubyMotion iOS Development Essentials

rm-ios-devRubyMotion is a continued topic of interest on this blog, and I will likely have more posts about it in the near future. At this stage I am still getting comfortable with iOS development, but I would much rather be doing it in the friendly playground of Ruby rather than the Objective-C world. In addition to the RubyMotion book from PragProg, the next resource I would recommend is RubyMotion iOS Development Essentials.*

The book takes a “zero-to-deploy” approach, which is great for beginners trying to get their first app into the App Store. The first few chapters  will be redundant for developers who have worked with RubyMotion before, but they provide a helpful introduction to how RM works and the Model-View-Controller paradigm.

For several chapters the book uses the same example application, a restaurant recommender reminiscent of Yelp. Demonstrating code by building up from a simple application is a nice way of presenting the application. By the time readers have worked through these chapters they will have an example app that is more interesting than many of the toy apps in shorter tutorials.

Later chapters will benefit novice and experienced developers alike, because they fill a gap in the RubyMotion literature. Many tutorials overlook the process of testing RM code, and testing iOS in general can be challenging. The testing chapter of this book goes over unit testing, functional testing, and tests that rely on device events such as gestures.

My favorite chapter in the book was chapter 6, which goes over device capabilities. At 46 pages this is the longest chapter in the book, covering GPS, gestures, Core Data, using the Address Book, and more. I especially enjoyed working through the section on accessing the camera and Photo Library. This is difficult to test on the simulator since there is (obviously) no access to a built-in camera (as with certain iOS devices including some iPod Touch models), but the example app covers how to handle this gracefully.

Stylistically, it can be a challenge to lay out a book that uses iOS API jargon like UIImagePickerControllerSourceTypePhotoLibrary. There were some gripes with the authors’ choice of two-space indenting, but that is my preference so it did not bother me. One addition I would have preferred would be additional formatting for the code, using either colors (for the e-book) or bolding (for the print version) to distinguish function names and help the reader keep their place in the code. The apps themselves rely mainly on iOS defaults. This is common in tutorials, but it also helps them look natural in iOS 7. Most of the time I was working through the book I used the iOS 6.1 simulator, but it was no problem to upgrade to iOS 7.

As a whole this book is a thorough introduction to RubyMotion development. It has several key features that are missing from other RubyMotion tutorials, including an emphasis on testing code. This book makes a great resource for new RubyMotion developers, or developers who want to use more of the device capabilities.

*Note: For this review I relied on the e-book version, which was provided as a review copy by Packt Publishing.

African Statistics and the Problem of Measurement

We have briefly mentioned Morten Jerven’s work Poor Numbers before, but it deserves a bit more attention. The book discusses the woeful state of GDP figures in Africa and the issues that arise in making cross-national comparisons between countries whose statistical offices operate very differently (interview here).

Discussing Jerven’s work now is especially timely given current events. Jerven was scheduled to speak at UNECA on statistical capacity in Africa. However, Pali Lehohla of South Africa strongly objected to Jerven’s ideas and led the opposition which ultimately prevented Jerven from speaking. Had he been allowed to present, Jerven’s speech would have summarized the issues thusly: 

I would argue that ambitions should be tempered in international development statistics. The international standardization of measurement of economic development has led to a procedural bias. There has been a tendency to aim for high adherence to procedures instead of focusing on the content of the measures. Development measures should be taken as a starting point in local data availability, and statisticians should refrain from reporting aggregate measures that appear to be based on data but in fact are very feeble projections or guesses. This means that it is necessary to shift the focus away from formulas, standards, handbooks, and software. What matters are what numbers are available and how good those numbers are. Comparability across time and space needs to start with the basic input of knowledge, not with the system in which this information is organized. (Jerven, 2013, p.107).

African Arguments gave Jerven a chance to respond to his opposition:

The initial response from many economists working on Africa varied between, ‘so what?…we already know this’, ‘we don’t trust or use official statistics on Africa anyhow’ and ‘I know but what is the alternative?’ Many more scholars in African studies and development studies, who were generally concerned with the long-standing use of numbers on Africa as ‘facts’, were relieved that there was finally someone who sought, not only to unveil the real state of affairs, but genuinely wanted to answer some of the problems that users face when trying to use the data to test their scholarly questions….

We need to rethink the demand for data and how we invest in data in Africa and beyond. My focus has been on Africa because the problem is particularly striking there. To fix the gaps we should first re-think the MDG and other donor agendas for data and do a cost benefit analysis – what are the costs of providing these data and what is the opportunity cost of providing these data? The opportunity cost is often ignored. Local demand for data needs to come into focus. A statistical office is only sustainable if it serves local needs for information. Statistics is a public good, and we need a good open debate on how to supply them.

This is a major issue, and all social scientists–not just economists–should be aware of Jerven’s work. As James C. Scott has pointed out, measurement is a political act.