Visualizing the Indian Buffet Process with Shiny

(This is a somewhat more technical post than usual. If you just want the gist, skip to the visualization.)

N customers enter an Indian buffet restaurant, one after another. It has a seemingly endless array of dishes. The first customer fills her plate with a Poisson(α) number of dishes. Each successive customer i tastes the previously sampled dishes in proportion to their popularity (the number of previous customers who have sampled the kth dish, m_k, divided by i). The ith customer then samples a Poisson(α) number of new dishes.

That’s the basic idea behind the Indian Buffet Process (IBP). On Monday Eli Bingham and I gave a presentation on the IBP in our machine learning seminar at Duke, taught by Katherine Heller. The IBP is used in Bayesian non-parametrics to put a prior on (exchangeability classes of) binary matrices. The matrices usually represent the presence of features (“dishes” above, or the columns of the matrix) in objects (“customers,” or the rows of the matrix). The culinary metaphor is used by analogy to the Chinese Restaurant Process.

Although the visualizations in the main paper summarizing the IBP are good, I thought it would be helpful to have an interactive visualization where you could change α and N to see how what a random matrix with those parameters looks like. For this I used Shiny, although it would also be fun to do in d3.

One realization of the IBP, with α=10.

One realization of the IBP, with α=10.

In the example above, the first customer (top row) sampled seven dishes. The second customer sampled four of those seven dishes, and then four more dishes that the first customer did not try. The process continues for all 10 customers. (Note that this matrix is not sorted into its left-ordered-form. It also sometimes gives an error if α << N, but I wanted users to be able to choose arbitrary values of N so I have not changed this yet.) You can play with the visualization yourself here.

Interactive online visualizations like this can be a helpful teaching tool, and the process of making them can also improve your own understanding of the process. If you would like to make another visualization of the IBP (or another machine learning tool that lends itself to graphical representation) I would be happy to share it here. I plan to add the Chinese restaurant process and a Dirichlet process mixture of Gaussians soon. You can find more about creating Shiny apps here.

Visualizing Political Unrest in Egypt, Syria, and Turkey

The lab of Michael D. Ward et al now has a blog. The inaugural post describes some of the lab’s ongoing projects that may come up in future entries including modeling of protests, insurgencies, and rebellions, event prediction (such as IED explosions), and machine learning techniques.

The second post compares two event data sets–GDELT and ICEWS–using recent political unrest in the Middle East as a focal point (more here):

We looked at protest events in Egypt and Turkey in 2011 and 2012 for both data sets, and we also looked at fighting in Syria over the same period…. What did we learn from these, limited comparisons?  First, we found out first hand what the GDELT community has been saying: the GDELT data are in BETA and currently have a lot of false positives. This is not optimal for a decision making aid such as ICEWS, in which drill-down to the specific events resulting in new predictions is a requirement. Second, no one has a good ground truth for event data — though we have some ideas on this and are working on a study to implement them. Third, geolocation is a boon. GDELT seems especially good a this, even with a lot of false positives.

The visualization, which I worked on as part of the lab, can be found here.  It relies on CartoDB to serve data from GDELT and ICEWS, with some preprocessing done using MySQL and R. The front-end is Javascript using a combination of d3 for timelines and Torque for maps.


GDELT (green) and ICEWS (blue) records of protests in Egypt and Turkey and conflict in Syria

If you have questions about the visualizations or the technology behind them, feel free to mention them here or on the lab blog.

Grad Student Gift Ideas

My sister is starting a graduate program this fall, so I wanted to put together a “grad school survival kit” gift basket for her. When I was looking, though, most search results for things like that were put together by gift basket companies and a large number of them include junk food as filler. While junk food can be a great stress reliever, I would not recommend making that the bulk of your gift to a grad student. Instead, consider some of the following gifts that range from practical to fun:


Recommended Packages for R 3.x

sandwichWith the recent release of R 3.0 (OS X) and 3.1 (Windows), I found myself in need of a whole host of packages for data analysis. Rather than discover each one I needed in the middle of doing real work, I thought it would be helpful to have a script with a list of essentials to speed up the process. This became even more essential when I also had to install R on a couple of machines in our department’s new offices.

Thankfully my colleague Shahryar Minhas had a similar idea and had already started a script, which I adapted and share here with his permission. The script is also on Github so if you have additions that you find essential on a new R install feel free to recommend them.

PACKAGES = c("Amelia",
install.packages('tikzDevice', repos='')

Project Design as Reproducibility Aid

From the Political Science Replication blog:

When reproducing pubished work, I’m often annoyed that methods and models are not described in detail. Even if it’s my own project, I sometimes struggle to reconstruct everything after I took a longer break from a project. An article by post-docs Rich FitzJohn and Daniel Falster shows how to set up a project structure that makes your work reproducible.

To get that “mix” into a reproducible format, post-docs Rich FitzJohn and Daniel Falster from Macquarie University in Australia suggest to use the same template for all projects. Their goal is to ensure integrity of their data, portability of the project, and to make it easier to reproduce your own work later. This can work in R, but in any other software as well.

Here’s their gist. My post from late last year suggests a similar structure. PSR and I were both informed about ProjectTemplate based on these posts–check it out here.

Managing Memory and Load Times in R and Python

Once I know I won’t need a file again, it’s gone. (Regular back-ups with Time Machine have saved me from my own excessive zeal at least once.) Similar economy applies to runtime: My primary computing device is my laptop, and I’m often too lazy to fire up a cloud instance unless the job would take more than a day.

Working with GDELT data for the last few weeks I’ve had to be a bit less conservative than usual. Habits are hard to break, though, so I found myself looking for a way to

  1. keep all the data on my hard-drive, and
  2. read it into memory quickly in R and/or Python.

The .zip files you can obtain from the GDELT site accomplish (1) but not (2). A .rda
binary helps with part of (2) but has the downside of being a binary file that I might not be able to open at some indeterminate point in the future–violating (1). And a memory-hogging CSV that also loads slowly is the worst option of all.

So what satisficing solution did I reach? Saving gzipped files (.gz). Both R and Python can read these files directly (R code shown below; in Python use
or the compression option for read_csv in pandas). It’s definitely smaller–the 1979 GDELT historical backfile compresses from 115.3MB to 14.3MB (an eighth of its former size). Reading directly into R from a .gz file has been available since at least version 2.10.

Is it faster? See for yourself:

> system.time(read.csv('1979.csv', sep='\t', header=F, flush=T, )
user system elapsed
48.930 1.126 50.918
> system.time(read.csv('1979.csv.gz', sep='\t', header=F, flush=T,
user system elapsed
23.202 0.849 24.064
> system.time(load('gd1979.rda'))
user system elapsed
5.939 0.182 7.577

Compressing and decompressing .gz files is straightforward too. In the OS X Terminal, just type gzip filename or gunzip filename, respectively

Reading the gzipped file takes less than half as long as the unzipped version. It’s still nowhere near as fast as loading the rda binary, but I don’t have to worry about file readability for many years to come given the popularity of *nix operating systems. Consider using .gz files for easy memory management and quick loading in R and Python.

JavaScript Politics

r-anarchismIn a recent conversation on Twitter, Christopher Zorn said that Stata is fascism, R is anarchism, and SAS is masochism. While only one of these is plausibly a programming language, it’s an interesting political analogy. We’ve discussed the politics of the Ruby language before.

Today I wanted to share a speaker deck by Angus Croll on the politics of Javascript. He describes periods of anarchy (1995-2004), revolution (2004-2006), and coming of age (2007-2010). We’re currently in “the itch” (2011-2013). There are a number of other political dimensions in the slides as well. Click the image below to see the deck in full.


If anyone knows of a video of the presentation, I’d love to see it. Croll also wrote an entertaining article with Javascript code in the style of famous authors like Hemingway, Dickens, and Shakespeare.

Converting and Standardizing Country Names/Codes in R

regular_expressionsWe have run into this issue before: you have n \geq 2 datasets with 1 < k \leq n different coding schemes for the cross-sectional unit. You need to get them all standardized so you can merge the data and increase the measurement error  control for a reviewer’s favorite variable run some models.

Last week I was about to spend some time merging alphanumeric ISO codes with their COW counterparts, when I ran across the new countrycode package in R.* The package uses regular expressions to convert between the following supported formats:

  • Correlates of War character
  • CoW-numeric
  • ISO3-character,
  • ISO3-numeric
  • ISO2-character
  • IMF numeric
  • FIPS 10-4
  • FAO numeric
  • United Nations numeric
  • World Bank character
  • official English short country names (ISO)
  • continent
  • region

The author is Vincent Arel-Bundock, a doctoral student in comparative politics at Michigan. Thanks Vincent!


* New here meaning I didn’t know about it before and its documentation is dated Jan. 20, 2013.

Automatically Setup New R and LaTeX Projects

You have a finite amount of keystrokes in your life. Automating repetitive steps is one of the great benefits of knowing a bit about coding, even if the code is just a simple shell script. That’s why I set up an example project on Github using a Makefile and sample article (inspired by Rob Hyndman). This post explains the structure of that project and explains how to modify it for your own purposes.

Running the Makefile with the example article in an otherwise-empty project directory will create:

  • a setup.R file for clearing the workspace, setting paths, and loading libraries
  • a data directory for storing files (csv, rda, etc)
  • a drafts directory for LaTeX, including a generic starter article
  • a graphics library for storing plots and figures to include in your article
  • an rcode directory for your R scripts

It also supplies some starter code for the setup.R file in the main directory and a start.R file in the rcode directory. This takes the current user and sets relative paths to the project directories with simple variable references in R. For example, after running setup.R in your analysis file you can switch to the data directory with setwd(pathData), then create a plot and save it after running setwd(pathGraphics). Because of the way the setup.R file works, you could have multiple users working on the same project and not need to change any of the other scripts.

If you want to change this structure, there are two main ways to do it. You can add more (or fewer) directories by modifying the mkdir lines in the Makefile. You can add (or remove) text in the R files by changing the echo lines.

If you decide to make major customizations to this file or already have your own structure for projects like this, leave a link in the comments for other readers.

H/T to Josh Cutler for encouraging me to write a post about my project structure and automation efforts. As a final note, this is the second New Year’s Eve in a row that I have posted a tech tip. Maybe it will become a YSPR tradition!

Update: If you’re interested in your own automated project setup, check out ProjectTemplate by John Myles White. Thank to Trey Causey on Twitter and Zachary Jones in the comments for sharing this.

Afghanistan Casualties Over Time and Space

The data comes from the Defense Casualty Analysis System for Operation Enduring Freedom. Here it is over time:

Notice the seasonality of deaths in Afghanistan, likely due to the harsh winters. Here is the same data plotted across space (service member home towns):

Not surprisingly, hometowns of OEF casualties are similar to those of service members killed in Iraq.