Visualizing the Indian Buffet Process with Shiny

(This is a somewhat more technical post than usual. If you just want the gist, skip to the visualization.)

N customers enter an Indian buffet restaurant, one after another. It has a seemingly endless array of dishes. The first customer fills her plate with a Poisson(α) number of dishes. Each successive customer i tastes the previously sampled dishes in proportion to their popularity (the number of previous customers who have sampled the kth dish, m_k, divided by i). The ith customer then samples a Poisson(α) number of new dishes.

That’s the basic idea behind the Indian Buffet Process (IBP). On Monday Eli Bingham and I gave a presentation on the IBP in our machine learning seminar at Duke, taught by Katherine Heller. The IBP is used in Bayesian non-parametrics to put a prior on (exchangeability classes of) binary matrices. The matrices usually represent the presence of features (“dishes” above, or the columns of the matrix) in objects (“customers,” or the rows of the matrix). The culinary metaphor is used by analogy to the Chinese Restaurant Process.

Although the visualizations in the main paper summarizing the IBP are good, I thought it would be helpful to have an interactive visualization where you could change α and N to see how what a random matrix with those parameters looks like. For this I used Shiny, although it would also be fun to do in d3.

One realization of the IBP, with α=10.

One realization of the IBP, with α=10.

In the example above, the first customer (top row) sampled seven dishes. The second customer sampled four of those seven dishes, and then four more dishes that the first customer did not try. The process continues for all 10 customers. (Note that this matrix is not sorted into its left-ordered-form. It also sometimes gives an error if α << N, but I wanted users to be able to choose arbitrary values of N so I have not changed this yet.) You can play with the visualization yourself here.

Interactive online visualizations like this can be a helpful teaching tool, and the process of making them can also improve your own understanding of the process. If you would like to make another visualization of the IBP (or another machine learning tool that lends itself to graphical representation) I would be happy to share it here. I plan to add the Chinese restaurant process and a Dirichlet process mixture of Gaussians soon. You can find more about creating Shiny apps here.

Constitutional Forks Revisited

Around this time last year, we discussed the idea of a constitutional “fork” that occurred with the founding of the Confederate States of America. That post briefly explains how forks work in open source software and how the Confederates used the US Constitution as the basis for their own, with deliberate and meaningful differences. Putting the two documents on Github allowed us to compare their differences visually and confirm our suspicions that many of them were related to issues of states’ rights and slavery.

Caleb McDaniel, a historian at Rice who undoubtedly has a much deeper and more thorough knowledge of the period, conducted a similar exercise and also posted his results on Github. He was faced with similar decisions of where to obtain the source text and which differences to retain as meaningful (for example, he left in section numbers where I did not). My method identifies 130 additions and 119 deletions when transitioning between the USA and CSA constitutions, whereas the stats for Caleb’s repo show 382 additions and 370 deletions.

What should we draw from these projects? In Caleb’s words:

My decisions make this project an interpretive act. You are welcome to inspect the changes more closely by looking at the commit histories for the individual Constitution files, which show the initial text as I got it from Avalon as well as the changes that I made.

You can take a look at both projects and conduct a difference-in-differences exploration of your own. More generally, these projects show the need for tools to visualize textual analyses, as well as the power of technology to enhance understanding of historical and political acts. Caleb’s readme file has great resources for learning more about this topic including the conversation that led him to this project, a New York Times interactive feature on the topic, and more.

Playing Chicken with Your Calendar

The ever-interesting Brendan Nelson on meeting chicken:

You have a regular meeting in your calendar. It’s with just one other person. Sometimes you have things to talk to them about and sometimes you don’t. But as long as your calendar says you both have to go, you will both go.

The day of the meeting comes round. There are lots of things that need to be done that day. You look at that meeting sitting obstinately in your calendar and think how useful it would be to get that time back.

Inspiration strikes: why not cancel the meeting? A couple of mouse clicks, an automatic notification sent out, a joyously blank calendar. It seems so easy.

But you can’t bring yourself to do it, to cancel a meeting at such short notice. It would make you look disorganised, unprepared. And what about the other person?

More at the link.

Don’t Forget Your Forever Stamps

The price of a first-class US stamp is set to increase from 46 to 49 cents on January 26. Like Cosmo Kramer’s Michigan bottle redemption plan (see below), Allison Schrager and Ritchie King ran the numbers on whether it would be possible to provide from Forever Stamp arbitrage.

Could the scheme make money? Maybe–if you get the timing right and pay low interest on capital:

Assuming we sell all 10 million stamps for the bulk discount price of $0.475 each, our profit will be $150,000. Subtract out the $399 for the distributor database. Let’s also assume we spent the $3,500 for Check Stand Program plus, say, $300 to make the 100 displays for advertising in stores. That gives us $145,801.

If we do manage to shift the stamps in a month, the interest on our debt will be $29,000. That brings our profits to $116,801. Then we’ll return the equity to our shareholders, along with 50% of the profits.

That leaves us with the other 50%: $58,400.50. If you look at that as a profit on the $4.6 million initial outlay, it’s not very much: less than 1.3%. But remember, all that outlay was leveraged. So if you look at it as a return on our investment—$33.25 for shipping—it’s 175,541%.

Github for Government

What happens when you combine open source software, open data, and open government? For the city of Munich, the switch to open source software has been a big success:

In one of the premier open source software deployments in Europe, the city migrated from Windows NT to LiMux, its own Linux distribution. LiMux incorporates a fully open source desktop infrastructure. The city also decided to use the Open Document Format (ODF) as a standard, instead of proprietary options.

As of November last year, the city saved more than €11.7 million because of the switch. More recent figures were not immediately available, but cost savings were not the only goal of the operation. It was also done to be less dependent on manufacturers, product cycles and proprietary OSes, the council said.

We’ve talked before about how more city governments could follow the open data, open government initiatives of NYC, using tech to benefit citizens rather than (only) creating initiatives to attract tech companies to the area. This shift in emphasis, toward harnessing the power of technology for widespread gains in happiness, is likely to become even more important following recent protests against tech employees in the Bay Area.

Open data and open government will take the principles of open source and use them to make an even bigger social and political impact. One tool from open source that can be adapted for use by these newer movements is Github. We will continue to follow these trends here, and if you are interested in this trend you can also check out Github and Government for more success stories.

What Can We Learn from Games?

ImageThis holiday season I enjoyed giving, receiving, and playing several new card and board games with friends and family. These included classics such as cribbage, strategy games like Dominion and Power Grid, and the whimsical Munchkin.

Can video and board games teach us more than just strategy? What if games could teach us not to be better thinkers, but just to be… better? A while ago we discussed how monopoly was originally designed as a learning experience to promote cooperation. Lately I have learned of two other such games in a growing genre and wanted to share them here.

The first is Depression Quest by Zoe Quinn (via Jeff Atwood):

Depression Quest is an interactive fiction game where you play as someone living with depression. You are given a series of everyday life events and have to attempt to manage your illness, relationships, job, and possible treatment. This game aims to show other sufferers of depression that they are not alone in their feelings, and to illustrate to people who may not understand the illness the depths of what it can do to people.

The second is Train by Brenda Romero (via Marcus Montano) described here with spoilers:

In the game, the players read typewritten instructions. The game board is a set of train tracks with box cars, sitting on top of a window pane with broken glass. There are little yellow pegs that represent people, and the player’s job is to efficiently load those people onto the trains. A typewriter sits on one side of the board.

The game takes anywhere from a minute to two hours to play, depending on when the players make a very important discovery. At some point, they turn over a card that has a destination for the train. It says Auschwitz. At that point, for anyone who knows their history, it dawns on the player that they have been loading Jews onto box cars so they can be shipped to a World War II concentration camp and be killed in the gas showers or burned in the ovens.

The key emotion that Romero said she wanted the player to feel was “complicity.”

“People blindly follow rules,” she said. “Will they blindly follow rules that come out of a Nazi typewriter?”

I have tried creating my own board games in the past, and this gives me renewed interest and a higher standard. What is the most thought-provoking moment you have experienced playing games?

Uncle Bob on Public Policy and Software Professionalism

Software developers need to develop their own professional standard, or politicians will do it for them. That’s what “Uncle” Bob Martin argues in this interview starting about 28:00:

Healthcare.gov was awful. That’s a case where a software failure interfered with a public policy. Whether you agree with that policy or not that should scare the hell out of you, because the next public policy may be one much more important and if our software can’t cope with it we could be in a really deep, deep hole.

At some point or another, some software team is going to screw up so badly that there is a disaster of tremendous loss of life. At that point the politicians of the world will decide they have to do something about it. If we are not there with a set of minimum standards that we follow, practices that we follow, if we can’t convince those politicians that we have been behaving professionally and that this was an accident–if we can’t convince them that we weren’t being negligent–then they’ll have no choice but to regulate us. They’ll pass laws about which languages we use, what platforms we can program on, what books we have to read, and so on. It will not be a good outcome. I don’t want to be a civil servant.

The Economy That Is Stanford

Five of the six most-visited websites in the world are here, in ranked order: Facebook, Google, YouTube (which Google owns), Yahoo! and Wikipedia. (Number five is a Chinese-language site.) If corporations founded by Stanford alumni were to form an independent nation, it would be the tenth largest economy in the world, with an annual revenue of $2.7 trillion, as some professors at that university recently calculated. Another new report says: ‘If the internet was a country, its gross domestic product would eclipse all others but four within four years.’

That’s from this London Review of Books piece by Rebecca Solnit. The October, 2012, research report on which the claim is based is here, based on survey data. Solnit’s piece is interesting throughout, including a discussion of parallels and differences between the tech boom and the Gold Rush.

Two Great Talks on Government and Technology

If you are getting ready to travel next week, you might want to have a couple of good talks/podcasts handy for the trip. Here are two that I enjoyed, on the topic of government and technology.

The first is about how technology can help governments. Ben Orenstein of “Giant Robots Smashing Into Other Giant Robots” discusses Code for America with Catherine Bracy. Catherine recounts some ups and downs of CfA’s partnerships with cities throughout America and internationally. CfA fellows commit a year to help local governments with challenges amenable to technology. One great example that the podcast discusses is a tool for parents in Boston to see which schools they could send their kids to when the city switched from location-based school assignment to allowing students to attend schools throughout the city. (Incidentally, the school matching algorithm that Boston used was designed by some professors in economics at Duke, who drew on work for which Roth and Shapley won the Nobel Prize.)

The second talk offers another point of view on techno-politics: when government abuses technology. Steve Klabnik‘s “No Secrets Allowed” talk from Golden Gate Ruby Conference discusses recent revelations regarding the NSA and privacy. In particular he explains why “I have nothing to hide” is not an appropriate response. The talk is not entirely hopeless, and includes recommendations such as using Tor. The Ruby Rogues also had a roundtable discussing Klabnik’s presentation, which you can find here.

Other recommendations are welcome.

Political Forecasting and the Use of Baseline Rates

As Joe Blitzstein likes to say, “Thinking conditionally is a condition for thinking.” Humans are not naturally good at this skill. Consider the following example: Kelly is interested in books and keeping things organized. She loves telling stories and attending book clubs. Is it more likely that Kelly is a bestselling novelist or an accountant?

Many of the “facts” about Kelly in that story might lead you to answer that she is a novelist. Only one–her sense of organization–might have pointed you toward an accountant. But think about the overall probability of each career. Very few bookworms become successful novelists, and there are many more accountants than (successful) authors in the modern workforce. Conditioning on the baseline rate helps make a more accurate decision.

I make a similar point–this time applied to political forecasting–in a recent blog post for the blog of Mike Ward’s lab (of which I am a member):

One piece of advice that Good Judgment forecasters are often reminded of is to use the baseline rate of an event as a starting point for their forecast. For example, insurgencies are a very rare event on the whole. For the period January, 2001 to August, 2013, insurgencies occurred in less than 10 percent of country-months in the ICEWS data set.

From this baseline, we can then incorporate information about the specific countries at hand and their recent history… Mozambique has not experienced an insurgency for the entire period of the ICEWS dataset. On the other hand, Chad had an insurgency that ended in December, 2003, and another that extended from November, 2005, to April, 2010. For the duration of the ICEWS data set, Chad has experienced an insurgency 59 percent of the time. This suggests that our predicted probability of insurgency in Chad should be higher than for Mozambique.

I started writing that post before rebels in Mozambique broke their treaty with the government. Maybe I spoke too soon, but the larger point is that baselines are the starting point–not the final product–of any successful forecast.

Having more data is useful, as long as it contributes more signal than noise. That’s what ICEWS aims to do, and I consider it a useful addition to the toolbox of forecasters participating in the Good Judgment Project. For more on this collaboration, as well as a map of insurgency rates around the globe as measured by ICEWS, see the aforementioned post here.