# Visualizing Political Unrest in Egypt, Syria, and Turkey

The lab of Michael D. Ward et al now has a blog. The inaugural post describes some of the lab’s ongoing projects that may come up in future entries including modeling of protests, insurgencies, and rebellions, event prediction (such as IED explosions), and machine learning techniques.

The second post compares two event data sets–GDELT and ICEWS–using recent political unrest in the Middle East as a focal point (more here):

We looked at protest events in Egypt and Turkey in 2011 and 2012 for both data sets, and we also looked at fighting in Syria over the same period…. What did we learn from these, limited comparisons?  First, we found out first hand what the GDELT community has been saying: the GDELT data are in BETA and currently have a lot of false positives. This is not optimal for a decision making aid such as ICEWS, in which drill-down to the specific events resulting in new predictions is a requirement. Second, no one has a good ground truth for event data — though we have some ideas on this and are working on a study to implement them. Third, geolocation is a boon. GDELT seems especially good a this, even with a lot of false positives.

The visualization, which I worked on as part of the lab, can be found here.  It relies on CartoDB to serve data from GDELT and ICEWS, with some preprocessing done using MySQL and R. The front-end is Javascript using a combination of d3 for timelines and Torque for maps.

GDELT (green) and ICEWS (blue) records of protests in Egypt and Turkey and conflict in Syria

If you have questions about the visualizations or the technology behind them, feel free to mention them here or on the lab blog.

# Recommended Packages for R 3.x

With the recent release of R 3.0 (OS X) and 3.1 (Windows), I found myself in need of a whole host of packages for data analysis. Rather than discover each one I needed in the middle of doing real work, I thought it would be helpful to have a script with a list of essentials to speed up the process. This became even more essential when I also had to install R on a couple of machines in our department’s new offices.

Thankfully my colleague Shahryar Minhas had a similar idea and had already started a script, which I adapted and share here with his permission. The script is also on Github so if you have additions that you find essential on a new R install feel free to recommend them.

PACKAGES = c("Amelia",
"apsrtable",
"arm",
"car",
"countrycode",
"cshapes",
"doBy",
"filehash",
"foreign",
"gdata",
"ggplot2",
"gridExtra",
"gtools",
"Hmisc",
"lme4",
"lmer",
"lmtest",
"maptools",
"MASS",
"Matrix",
"mice",
"mvtnorm",
"plm",
"plyr",
"pscl",
"qcc",
"RColorBrewer",
"reshape",
"sandwich",
"sbgcop",
"scales",
"sp",
"xlsx",
"xtable")
install.packages(PACKAGES)
install.packages('tikzDevice', repos='http://r-forge.r-project.org')

# Project Design as Reproducibility Aid

From the Political Science Replication blog:

When reproducing pubished work, I’m often annoyed that methods and models are not described in detail. Even if it’s my own project, I sometimes struggle to reconstruct everything after I took a longer break from a project. An article by post-docs Rich FitzJohn and Daniel Falster shows how to set up a project structure that makes your work reproducible.

To get that “mix” into a reproducible format, post-docs Rich FitzJohn and Daniel Falster from Macquarie University in Australia suggest to use the same template for all projects. Their goal is to ensure integrity of their data, portability of the project, and to make it easier to reproduce your own work later. This can work in R, but in any other software as well.

Here’s their gist. My post from late last year suggests a similar structure. PSR and I were both informed about ProjectTemplate based on these posts–check it out here.

# Managing Memory and Load Times in R and Python

Once I know I won’t need a file again, it’s gone. (Regular back-ups with Time Machine have saved me from my own excessive zeal at least once.) Similar economy applies to runtime: My primary computing device is my laptop, and I’m often too lazy to fire up a cloud instance unless the job would take more than a day.

Working with GDELT data for the last few weeks I’ve had to be a bit less conservative than usual. Habits are hard to break, though, so I found myself looking for a way to

1. keep all the data on my hard-drive, and
2. read it into memory quickly in R and/or Python.

The .zip files you can obtain from the GDELT site accomplish (1) but not (2). A .rda
binary helps with part of (2) but has the downside of being a binary file that I might not be able to open at some indeterminate point in the future–violating (1). And a memory-hogging CSV that also loads slowly is the worst option of all.

So what satisficing solution did I reach? Saving gzipped files (.gz). Both R and Python can read these files directly (R code shown below; in Python use gzip.open
or the compression option for read_csv in pandas). It’s definitely smaller–the 1979 GDELT historical backfile compresses from 115.3MB to 14.3MB (an eighth of its former size). Reading directly into R from a .gz file has been available since at least version 2.10.

Is it faster? See for yourself:

> system.time(read.csv('1979.csv', sep='\t', header=F, flush=T, as.is=T) )
user system elapsed
48.930 1.126 50.918
user system elapsed
23.202 0.849 24.064
user system elapsed
5.939 0.182 7.577

Compressing and decompressing .gz files is straightforward too. In the OS X Terminal, just type gzip filename or gunzip filename, respectively

Reading the gzipped file takes less than half as long as the unzipped version. It’s still nowhere near as fast as loading the rda binary, but I don’t have to worry about file readability for many years to come given the popularity of *nix operating systems. Consider using .gz files for easy memory management and quick loading in R and Python.

# JavaScript Politics

In a recent conversation on Twitter, Christopher Zorn said that Stata is fascism, R is anarchism, and SAS is masochism. While only one of these is plausibly a programming language, it’s an interesting political analogy. We’ve discussed the politics of the Ruby language before.

Today I wanted to share a speaker deck by Angus Croll on the politics of Javascript. He describes periods of anarchy (1995-2004), revolution (2004-2006), and coming of age (2007-2010). We’re currently in “the itch” (2011-2013). There are a number of other political dimensions in the slides as well. Click the image below to see the deck in full.

If anyone knows of a video of the presentation, I’d love to see it. Croll also wrote an entertaining article with Javascript code in the style of famous authors like Hemingway, Dickens, and Shakespeare.

# Converting and Standardizing Country Names/Codes in R

We have run into this issue before: you have $n \geq 2$ datasets with $1 < k \leq n$ different coding schemes for the cross-sectional unit. You need to get them all standardized so you can merge the data and increase the measurement error  control for a reviewer’s favorite variable run some models.

Last week I was about to spend some time merging alphanumeric ISO codes with their COW counterparts, when I ran across the new countrycode package in R.* The package uses regular expressions to convert between the following supported formats:

• Correlates of War character
• CoW-numeric
• ISO3-character,
• ISO3-numeric
• ISO2-character
• IMF numeric
• FIPS 10-4
• FAO numeric
• United Nations numeric
• World Bank character
• ofﬁcial English short country names (ISO)
• continent
• region

The author is Vincent Arel-Bundock, a doctoral student in comparative politics at Michigan. Thanks Vincent!

________________________________________

* New here meaning I didn’t know about it before and its documentation is dated Jan. 20, 2013.

# Automatically Setup New R and LaTeX Projects

You have a finite amount of keystrokes in your life. Automating repetitive steps is one of the great benefits of knowing a bit about coding, even if the code is just a simple shell script. That’s why I set up an example project on Github using a Makefile and sample article (inspired by Rob Hyndman). This post explains the structure of that project and explains how to modify it for your own purposes.

Running the Makefile with the example article in an otherwise-empty project directory will create:

• a setup.R file for clearing the workspace, setting paths, and loading libraries
• a data directory for storing files (csv, rda, etc)
• a drafts directory for LaTeX, including a generic starter article
• a graphics library for storing plots and figures to include in your article
• an rcode directory for your R scripts

It also supplies some starter code for the setup.R file in the main directory and a start.R file in the rcode directory. This takes the current user and sets relative paths to the project directories with simple variable references in R. For example, after running setup.R in your analysis file you can switch to the data directory with setwd(pathData), then create a plot and save it after running setwd(pathGraphics). Because of the way the setup.R file works, you could have multiple users working on the same project and not need to change any of the other scripts.

If you want to change this structure, there are two main ways to do it. You can add more (or fewer) directories by modifying the mkdir lines in the Makefile. You can add (or remove) text in the R files by changing the echo lines.

If you decide to make major customizations to this file or already have your own structure for projects like this, leave a link in the comments for other readers.

H/T to Josh Cutler for encouraging me to write a post about my project structure and automation efforts. As a final note, this is the second New Year’s Eve in a row that I have posted a tech tip. Maybe it will become a YSPR tradition!

Update: If you’re interested in your own automated project setup, check out ProjectTemplate by John Myles White. Thank to Trey Causey on Twitter and Zachary Jones in the comments for sharing this.

# Afghanistan Casualties Over Time and Space

The data comes from the Defense Casualty Analysis System for Operation Enduring Freedom. Here it is over time:

Notice the seasonality of deaths in Afghanistan, likely due to the harsh winters. Here is the same data plotted across space (service member home towns):

Not surprisingly, hometowns of OEF casualties are similar to those of service members killed in Iraq.

# Simulating the NLDS: Can the Giants Win?

In Allen Downey’s new book, Think Bayes, he relates the “Boston Bruins” problem. The problem is to estimate the Bruins’ probability of winning the 2010-2011 NHL championship after two wins and two losses. I will briefly describe Downey’s approach, and then relate it to the current situation of the San Francisco Giants.

One (naive) approach would be to model this as a gambler’s ruin problem. There are two problems with that model for this problem: the total number of rounds to be played is uncertain (i.e. the championship is a best of n rather than play until one side is totally defeated), and it throws away important information about the score of the games.

Instead, we model baseball as a Poisson process, in which it is equally likely for a run to be scored at any time during the game. This is still somewhat of an oversimplification (the odds are better when you have runners on base, for example), but we are getting closer to the “true” model. Second, we assume that games between the Reds and Giants in this year’s National League Division Series are similar enough that they can be considered as outcomes from Poisson distributions in which each team’s scoring distribution is consistent between games with parameter λ. Different pitchers could cause this assumption to be thrown off (no pun intended), but we will again use it as a not-entirely-implausible simplification.

Having made our assumptions, we now use a four-step process proposed by Downey:

1. Use statistics from previous games to choose a prior distribution for λ.
2. Use the score from the first four games to estimate λ for each team.
3. Use the posterior distributions of λ to compute distribution of goals for each team, the distribution of the goal differential, and the probability that each team wins.
4. Simulate the rest of the series to estimate the probability of each possible outcome.

To calculate λ, we will use the team batting stats from ESPN and the thinkbayes Python package from Downey’s site.

Here is the distribution of λ, using regular season scoring as the prior and updating with the results of the first four games of the division series:

And here are the predicted runs-per-game by team, using simulations:

According to the model’s predictions, the probability that the Giants win today’s game (and the division series) is 0.387. I would have preferred to use a Gamma prior for λ and run some more simulations in R, but I wanted to use Downey’s example and get this up before the game started… which was a few minutes ago (although as I post, the score is still 0-0). Either way, enjoy the game!