Tirole on Open Source

Jean Tirole is the latest recipient of the Nobel prize in economics, as was announced Monday. For more background on his work, see NPR and the New Yorker. My favorite portion of Tirole’s work (and, admittedly, pretty much the only part I’ve read) is his work on open source software communities. Much of this is joint work with Josh Lerner. Below I share a few selections from his work that indicate the general theme.

open_sourceThere are two main economic puzzles to open source software. First, why would highly skilled workers who earn a substantial hourly wage contribute their time to developing a product they won’t directly sell (and how do they convince their employers, in some cases, to support this)? Second, given the scale of these projects, how do they self-govern to set priorities and direct effort?

The answer to the first question is a combination of personal reputation and the ability to develop complementary software (Lerner and Tirole, 2002, p. 215-217). Most software work is “closed source,” meaning others can see the finished product but not the underlying code. For software developers, having your code out in the open gives others (especially potential collaborators or employers) the chance to assess your abilities. This is important to ensure career mobility. Open source software is also a complement to personal or professional projects. When there are components that are common across many projects, such as an operating system (Linux) or web framework (Rails), it makes sense for many programmers to contribute their effort to build a better mousetrap. This shared component can then improve everyone’s future projects by saving them time or effort. The collaboration of many developers also helps to identify bugs that may not have been caught by any single individual. Some of Tirole’s earlier work on collective reputations is closely related, as their appears to be an “alumni effect” for developers who participated in successful projects.

Tirole and Lerner’s answer to the second question revolves around leadership. Leaders are often the founders of or early participants in the open software project. Their skills and early membership status instill trust. As the authors put it, other programmers “must believe that the leader’s objectives are sufficiently congruent with theirs and not polluted by ego-driven, commercial, or political biases. In the end, the leader’s recommendations are only meant to convey her information to the community of participants.” (Lerner and Tirole, 2002, p. 222) This relates to some of Tirole’s other work, with Roland Benabou, on informal laws and social norms.

Again, this is only a small portion of Tirole’s work, but I find it fascinating. There’s more on open source governance in the archives. This post on reputation in hacker culture or this one on the Ruby community are good places to start.

Classifying Olympic Athletes by Sport and Event (Part 3)

This is the last post in a three-part series. Part one, describing the data, is here. Part two gives an overview of the machine learning methods and can be found here. This post presents the results.

To present the results I will use classification matrices, transformed into heatmaps. The rows indicate Olympians’ actual sports, and the columns are their predicted sports. A dark value on the diagonal indicates accurate predictions (the athlete is predicted to be in their actual sport) while light values on the diagonal suggest that Olympians in a certain sport are misclassified by the algorithms used. In each case results for the training set are in the left column and results for the test set are on the right. For a higher resolution version, see this pdf.

Classifying Athletes by Sport

sport-matrices

 

For most rows, swimming is the most common predicted sport. That’s partially because there are so many swimmers in the data and partially due to the fact that swimmers have a fairly generic body type as measured by height and weight (see the first post). With more features such as arm length and torso length we could better distinguish between swimmers and non-swimmers.

Three out of the four methods perform similarly. The real oddball here is random forest: it classifies the training data very well, but does about as well on the test data as the other methods. This suggests that random forest is overfitting the data, and won’t give us great predictions on new data.

Classifying Athletes by Event

event-matrices

The results here are similar to the ones above: all four methods do about equally well for the test data, while random forest overfits the training data. The two squares in each figure represent male and female sports. This is a good sanity check–at least our methods aren’t misclassifying men into women’s events or vice versa (recall that sex is one of the four features used for classification).

Accuracy

Visualizations are more helpful than looking at a large table of predicted probabilities, but what are the actual numbers? How accurate are the predictions from these methods? The table below presents accuracy for both tasks, for training and test sets.

accuracy

The various methods classify Olympians into sports and events with about 25-30 percent accuracy. This isn’t great performance. Keep in mind that we only had four features to go on, though–with additional data about the participants we could probably do better.

After seeing these results I am deeply skeptical that David Epstein could classify Olympians by event using only their height and weight. Giving him the benefit of the doubt, he probably had in mind the kind of sports and events that we saw were easy to classify: basketball, weightlifting, and high jump, for example. These are the types of competitions that The Sports Gene focuses on. As we have seen, though, there is a wide range of sporting events and a corresponding diversity of body types. Being naturally tall or strong doesn’t hurt, by it also doesn’t automatically qualify you for the Olympics. Training and hard work play an important role, and Olympic athletes exhibit a wide range of physical characteristics.

Classifying Olympic Athletes by Sport and Event (Part 2)

This is the second post in a three-part series. The first post, giving some background and describing the data, is here. In that post I pointed out David Epstein’s claim that he could identify an Olympian’s event knowing only her height and weight. The sheer number of Olympians–about 10,000–makes me skeptical, but I decided to see whether machine learning could accurately produce the predictions Mr. Epstein claims he could.

To do this, I tried four different machine learning methods. These are all well-documented methods implemented in existing R packages. Code and data for is here (for sports) and here (for events).

The first two methods, conditional inference trees (using the party package) and evolutionary trees (using evtree), are both decision tree-based approaches. That means that they sequentially split the data based on binary decisions. If the data falls on one side of the split (say, height above 1.8 meters) you continue down one fork of the tree, and if not you go down the other fork. The difference between these two methods is how the tree is formed: the first recursively partitions the data based on conditional probability, while the second method (as the name suggests) uses an evolutionary algorithm. To get a feel for how this actually divides the data, see the figure below and this post.

 

If a single tree is good, a whole forest must be better–or at least that’s the thinking behind random forests, the third method I used. This method generates a large number of trees (500 in this case), each of which has access to only some of the features in the data. Once we have a whole forest of trees, we combine their predictions (usually through a voting process). The combination looks a little bit like the figure below, and a good explanation is here.

 

The fourth and final method used–artificial neural networks–is a bit harder to visualize. Neural networks are sort of a black box, making them difficult to interpret and explain. At a coarse level they are intended to work like neurons in the brain: take some input, and produce output based on whether the input crosses a certain threshold. The neural networks I used have a single hidden layer with 30 (for sports classification) or 50 hidden nodes (for event classification). To get a better feel for how neural networks work, see this three part series.

That’s a very quick overview of the four machine learning methods that I applied to classifying Olympians by sport and event. The data and R code are available at the link above. In the next post, scheduled for Friday, I’ll share the results.

Classifying Olympic Athletes by Sport and Event (Part 1)

Note: This post is the first in a three-part series. It describes the motivation for this project and the data used. When parts two and three are posted I will link to them here.

Can you predict which sport or event an Olympian competes in based solely on her height, weight, age and sex? If so, that would suggest that physical features strongly drive athletes’ relative abilities across sports, and that they pick sports that best leverage their physical predisposition. If not, we might infer that athleticism is a latent trait (like “grit“) that can be applied to the sport of one’s choice.

SportsGeneDavid Epstein argues that sporting success is largely based on heredity in his book, The Sports Gene. To support his argument, he describes how elite athletes’ physical features have become more specialized to their sport over time (think Michael Phelps). At a basic level Epstein is correct: males and females differ at both a genetic level and in their physical features, generally speaking.

However, Epstein advanced a stronger claim in an interview (at 29:46) with Russ Roberts:

Roberts: [You argue that] if you simply had the height and weight of an Olympic roster, you could do a pretty good job of guessing what their events are. Is that correct?

Epstein: That’s definitely correct. I don’t think you would get every person accurately, but… I think you would get the vast majority of them correctly. And frankly, you could definitely do it easily if you had them charted on a height-and-weight graph, and I think you could do it for most positions in something like football as well.

I chose to assess Epstein’s claim in a project for a machine learning course at Duke this semester. The data was collected by The Guardian, and includes all participants for the 2012 London Summer Olympics. There was complete data on age, sex, height, and weight for 8,856 participants, excluding dressage (an oddity of the data is that every horse-rider pair was treated as the sole participant in a unique event described by the horse’s name). Olympians participate in one or more events (fairly specific competitions, like a 100m race), which are nested in sports (broader categories such as “Swimming” or “Athletics”).

Athletics is by far the largest sport category (around 20 percent of athletes), so when it was included it dominated the predictions. To get more accurate classifications, I excluded Athletics participants from the sport classification task. This left 6,956 participants in 27 sports, split into a training set of size 3,520 and a test set of size 3,436. The 1,900 Athletics participants were classified into 48 different events, and also split into training (907 observations) and test sets (993 observations). For athletes participating in more than one event, only their first event was used.

What does an initial look at the data tell us? The features of athletes in some sports (Basketball, Rowing, Weightlifting, and Wrestling) and events (100m hurdles, Hammer throw, High jump, and Javelin) exhibit strong clustering patters. This makes it relatively easy to guess a participant’s sport or event based on her features. In other sports (Archery, Swimming, Handball, Triathlon) and events (100m race, 400m hurdles, 400m race, and Marathon) there are many overlapping clusters making classification more difficult.

sport-descriptive

Well-defined (left) and poorly-defined clusters of height and weight by sport.

Well-defined (left) and poorly-defined clusters of height and weight by event.

Well-defined (left) and poorly-defined clusters of height and weight by event.

The next post, scheduled for Wednesday, will describe the machine learning methods I applied to this problem. The results will be presented on Friday.

Two Unusual Papers on Monte Carlo Simulation

For Bayesian inference, Markov Chain Monte Carlo (MCMC) methods were a huge breakthrough. These methods provide a principled way for simulating from a posterior probability distribution, and are useful for integrating distributions that are computationally intractable. Usually MCMC methods are performed with computers, but I recently read two papers that apply Monte Carlo simulation in interesting ways.

The first is Markov Chain Monte Carlo with People. MCMC with people is somewhat similar to playing the game of telephone–there is input “data” (think of the starting word in the telephone game) that is transmitted across stages where it can be modified and then output at the end. In the paper the authors construct a task so that human learners approximately follow an MCMC acceptance rule. I have summarized the paper in slightly more detail here.

The second paper is even less conventional: the authors approximate the value of π using a “Mossberg 500 pump-action shotgun as the proposal distribution.” Their simulated value is 3.131, within 0.33% of the true value. As the authors state, “this represents the first attempt at estimating π using such method, thus opening up new perspectives towards computing mathematical constants using everyday tools.” Who said statistics has to be boring?

 

Schneier on Data and Power

Data and Power is the tentative title of a new book, forthcoming from Bruce Schneier. Here’s more from the post describing the topic of the book:

Corporations are collecting vast dossiers on our activities on- and off-line — initially to personalize marketing efforts, but increasingly to control their customer relationships. Governments are using surveillance, censorship, and propaganda — both to protect us from harm and to protect their own power. Distributed groups — socially motivated hackers, political dissidents, criminals, communities of interest — are using the Internet to both organize and effect change. And we as individuals are becoming both more powerful and less powerful. We can’t evade surveillance, but we can post videos of police atrocities online, bypassing censors and informing the world. How long we’ll still have those capabilities is unclear….

There’s a fundamental trade-off we need to make as society. Our data is enormously valuable in aggregate, yet it’s incredibly personal. The powerful will continue to demand aggregate data, yet we have to protect its intimate details. Balancing those two conflicting values is difficult, whether it’s medical data, location data, Internet search data, or telephone metadata. But balancing them is what society needs to do, and is almost certainly the fundamental issue of the Information Age.

There’s more at the link, including several other potential titles. The topic will likely interest many readers of this blog. It will likely build on his ideas of inequality and online feudalism, discussed here.

Visualizing the Indian Buffet Process with Shiny

(This is a somewhat more technical post than usual. If you just want the gist, skip to the visualization.)

N customers enter an Indian buffet restaurant, one after another. It has a seemingly endless array of dishes. The first customer fills her plate with a Poisson(α) number of dishes. Each successive customer i tastes the previously sampled dishes in proportion to their popularity (the number of previous customers who have sampled the kth dish, m_k, divided by i). The ith customer then samples a Poisson(α) number of new dishes.

That’s the basic idea behind the Indian Buffet Process (IBP). On Monday Eli Bingham and I gave a presentation on the IBP in our machine learning seminar at Duke, taught by Katherine Heller. The IBP is used in Bayesian non-parametrics to put a prior on (exchangeability classes of) binary matrices. The matrices usually represent the presence of features (“dishes” above, or the columns of the matrix) in objects (“customers,” or the rows of the matrix). The culinary metaphor is used by analogy to the Chinese Restaurant Process.

Although the visualizations in the main paper summarizing the IBP are good, I thought it would be helpful to have an interactive visualization where you could change α and N to see how what a random matrix with those parameters looks like. For this I used Shiny, although it would also be fun to do in d3.

One realization of the IBP, with α=10.

One realization of the IBP, with α=10.

In the example above, the first customer (top row) sampled seven dishes. The second customer sampled four of those seven dishes, and then four more dishes that the first customer did not try. The process continues for all 10 customers. (Note that this matrix is not sorted into its left-ordered-form. It also sometimes gives an error if α << N, but I wanted users to be able to choose arbitrary values of N so I have not changed this yet.) You can play with the visualization yourself here.

Interactive online visualizations like this can be a helpful teaching tool, and the process of making them can also improve your own understanding of the process. If you would like to make another visualization of the IBP (or another machine learning tool that lends itself to graphical representation) I would be happy to share it here. I plan to add the Chinese restaurant process and a Dirichlet process mixture of Gaussians soon. You can find more about creating Shiny apps here.

A Chrome Extension for XKCD Substitutions

This morning’s XKCD had some fun suggestions for replacing key phrases to make news articles more fun:

Regular readers may recall my Doublespeak Chrome extension, which works on the same principle. In short order, I was able to create a new app, XKCDSub, that works the same way: install the extension, and when you click its icon it will open your current page in a new tab with the phrases replaced. Here is an example of the extension in action on Elon Musk’s Wikipedia page:

elon

The code is open source on Github. You can find it in the Chrome webstore here.

Visualizing Political Unrest in Egypt, Syria, and Turkey

The lab of Michael D. Ward et al now has a blog. The inaugural post describes some of the lab’s ongoing projects that may come up in future entries including modeling of protests, insurgencies, and rebellions, event prediction (such as IED explosions), and machine learning techniques.

The second post compares two event data sets–GDELT and ICEWS–using recent political unrest in the Middle East as a focal point (more here):

We looked at protest events in Egypt and Turkey in 2011 and 2012 for both data sets, and we also looked at fighting in Syria over the same period…. What did we learn from these, limited comparisons?  First, we found out first hand what the GDELT community has been saying: the GDELT data are in BETA and currently have a lot of false positives. This is not optimal for a decision making aid such as ICEWS, in which drill-down to the specific events resulting in new predictions is a requirement. Second, no one has a good ground truth for event data — though we have some ideas on this and are working on a study to implement them. Third, geolocation is a boon. GDELT seems especially good a this, even with a lot of false positives.

The visualization, which I worked on as part of the lab, can be found here.  It relies on CartoDB to serve data from GDELT and ICEWS, with some preprocessing done using MySQL and R. The front-end is Javascript using a combination of d3 for timelines and Torque for maps.

gdelt-icews-static

GDELT (green) and ICEWS (blue) records of protests in Egypt and Turkey and conflict in Syria

If you have questions about the visualizations or the technology behind them, feel free to mention them here or on the lab blog.

Review: RubyMotion iOS Development Essentials

rm-ios-devRubyMotion is a continued topic of interest on this blog, and I will likely have more posts about it in the near future. At this stage I am still getting comfortable with iOS development, but I would much rather be doing it in the friendly playground of Ruby rather than the Objective-C world. In addition to the RubyMotion book from PragProg, the next resource I would recommend is RubyMotion iOS Development Essentials.*

The book takes a “zero-to-deploy” approach, which is great for beginners trying to get their first app into the App Store. The first few chapters  will be redundant for developers who have worked with RubyMotion before, but they provide a helpful introduction to how RM works and the Model-View-Controller paradigm.

For several chapters the book uses the same example application, a restaurant recommender reminiscent of Yelp. Demonstrating code by building up from a simple application is a nice way of presenting the application. By the time readers have worked through these chapters they will have an example app that is more interesting than many of the toy apps in shorter tutorials.

Later chapters will benefit novice and experienced developers alike, because they fill a gap in the RubyMotion literature. Many tutorials overlook the process of testing RM code, and testing iOS in general can be challenging. The testing chapter of this book goes over unit testing, functional testing, and tests that rely on device events such as gestures.

My favorite chapter in the book was chapter 6, which goes over device capabilities. At 46 pages this is the longest chapter in the book, covering GPS, gestures, Core Data, using the Address Book, and more. I especially enjoyed working through the section on accessing the camera and Photo Library. This is difficult to test on the simulator since there is (obviously) no access to a built-in camera (as with certain iOS devices including some iPod Touch models), but the example app covers how to handle this gracefully.

Stylistically, it can be a challenge to lay out a book that uses iOS API jargon like UIImagePickerControllerSourceTypePhotoLibrary. There were some gripes with the authors’ choice of two-space indenting, but that is my preference so it did not bother me. One addition I would have preferred would be additional formatting for the code, using either colors (for the e-book) or bolding (for the print version) to distinguish function names and help the reader keep their place in the code. The apps themselves rely mainly on iOS defaults. This is common in tutorials, but it also helps them look natural in iOS 7. Most of the time I was working through the book I used the iOS 6.1 simulator, but it was no problem to upgrade to iOS 7.

As a whole this book is a thorough introduction to RubyMotion development. It has several key features that are missing from other RubyMotion tutorials, including an emphasis on testing code. This book makes a great resource for new RubyMotion developers, or developers who want to use more of the device capabilities.

*Note: For this review I relied on the e-book version, which was provided as a review copy by Packt Publishing.