A Checklist for Using Open Source Software in Production

A great majority of the web is built on open source software. Approximately two-thirds of public servers on the internet run a *nix operating system, and over half of those are Linux. The most popular server-side programming languages also tend to be open source (including my favorite, Ruby). This post is about adding a new open source library to an existing code base. What questions should you ask before adding such a dependency to a production application?

The first set of questions are the most basic. A “no” to any of these should prompt you to look elsewhere.
  • Is the project written in a language you support? Is it in a language you support? If not, is it compatible (e.g. through stdin/stdout or by compiling to your language of choice)?
  • Is the project in a version of of the language you support? If it’s written in Python 3 and you only support Python 2, for example, using this library could lead to headaches.
  • Can you use the project in your framework of choice (e.g. Rails or Django)?
  • Are there conflicts with other libraries or packages you’re currently using? (This is probably the hardest question to answer, and you might not know until you try it.)
Assuming there are no immediate technical barriers, the next questions to ask are of the legal variety. Open source licenses come in many flavors. In the absence of a license, traditional copyright rules apply. Be especially careful if the project you are investigating uses the GPL license–even basing the code you write off of a GPL open source project can have serious legal ramifications. There’s a great guide to OSS licenses on Github. If you’re the author or maintainer of an open source project checkout choosealicense.com.
The next thing to consider is whether and how the project is tested. If there is not an automated test suite, consider starting one as your first contribution to the project and be very reluctant to add the project to your application. Other related questions include:
  • Are there unit tests?
  • Are there integration tests?
  • What is the test coverage like?
  • Do the tests run quickly?
  • Are the tests clearly written?
Finally, by using an open source project you are also joining a community of developers. None of these questions are necessarily show-stoppers but knowing the size of the community and the tone of its discourse can save you pain down the road.
  • Is the project actively maintained? When was the last commit?
  • Does the community have a civil, professional style of debate and discussion?
  • Is there only one developer/maintainer who knows everything? This doesn’t have to be a deal breaker. However, if there is a single gatekeeper you should make sure you understand the basics of the code and could fork the project if necessary.

This is by no means an exhaustive list but these questions can serve as a useful checklist before adding an open source as a dependency for your project.

Review: RubyMotion iOS Development Essentials

rm-ios-devRubyMotion is a continued topic of interest on this blog, and I will likely have more posts about it in the near future. At this stage I am still getting comfortable with iOS development, but I would much rather be doing it in the friendly playground of Ruby rather than the Objective-C world. In addition to the RubyMotion book from PragProg, the next resource I would recommend is RubyMotion iOS Development Essentials.*

The book takes a “zero-to-deploy” approach, which is great for beginners trying to get their first app into the App Store. The first few chapters  will be redundant for developers who have worked with RubyMotion before, but they provide a helpful introduction to how RM works and the Model-View-Controller paradigm.

For several chapters the book uses the same example application, a restaurant recommender reminiscent of Yelp. Demonstrating code by building up from a simple application is a nice way of presenting the application. By the time readers have worked through these chapters they will have an example app that is more interesting than many of the toy apps in shorter tutorials.

Later chapters will benefit novice and experienced developers alike, because they fill a gap in the RubyMotion literature. Many tutorials overlook the process of testing RM code, and testing iOS in general can be challenging. The testing chapter of this book goes over unit testing, functional testing, and tests that rely on device events such as gestures.

My favorite chapter in the book was chapter 6, which goes over device capabilities. At 46 pages this is the longest chapter in the book, covering GPS, gestures, Core Data, using the Address Book, and more. I especially enjoyed working through the section on accessing the camera and Photo Library. This is difficult to test on the simulator since there is (obviously) no access to a built-in camera (as with certain iOS devices including some iPod Touch models), but the example app covers how to handle this gracefully.

Stylistically, it can be a challenge to lay out a book that uses iOS API jargon like UIImagePickerControllerSourceTypePhotoLibrary. There were some gripes with the authors’ choice of two-space indenting, but that is my preference so it did not bother me. One addition I would have preferred would be additional formatting for the code, using either colors (for the e-book) or bolding (for the print version) to distinguish function names and help the reader keep their place in the code. The apps themselves rely mainly on iOS defaults. This is common in tutorials, but it also helps them look natural in iOS 7. Most of the time I was working through the book I used the iOS 6.1 simulator, but it was no problem to upgrade to iOS 7.

As a whole this book is a thorough introduction to RubyMotion development. It has several key features that are missing from other RubyMotion tutorials, including an emphasis on testing code. This book makes a great resource for new RubyMotion developers, or developers who want to use more of the device capabilities.

*Note: For this review I relied on the e-book version, which was provided as a review copy by Packt Publishing.

Grad Student Gift Ideas

My sister is starting a graduate program this fall, so I wanted to put together a “grad school survival kit” gift basket for her. When I was looking, though, most search results for things like that were put together by gift basket companies and a large number of them include junk food as filler. While junk food can be a great stress reliever, I would not recommend making that the bulk of your gift to a grad student. Instead, consider some of the following gifts that range from practical to fun:


Ruby’s Benevolent Dictator

The Ruby Logo

The Ruby Logo

The first version of the Ruby programming language was developed by Yukihiro Matsumoto, better known as “Matz,” in 1995. Since then it has become especially popular for web development thanks to the advent of Rails by DHH. A variety of Ruby implementations have also sprung up, optimized for various uses. You may recall our recent discussion of RubyMotion as a well to develop iOS apps in Ruby. As with human languages, the spread and evolution of computer languages raises an interesting question: how different can two things be and still be the same?

To run with the human language example for a bit, consider the following. My native language is American English. (There are a number of regional variants within the US, so even the fact that American English is a useful category is telling.) I would recognize a British citizen with a cockney accent as a speaker of the same language, even though I would have trouble understanding him or her. I would not, however, recognize a French speaker as someone with whom I shared a language. The latter distinction exists despite the relative similarity between the languages–a shared alphabet, shared roots in Latin, and so on. So who decides whether two languages are the same?

In the case of human languages this is very much an emergent decision, worked out through the behavior of numerous individuals with little conscious thought for their coordination. This is where the human/computer language analogy fails us. The differences between computer languages are discrete, not continuous–there are measurable differences and similarities between any two language implementations, and intermediate steps between one implementation and another might not be viable. So who decides what is Ruby and what is not?

That is the question Brian Shirai raised in a series of posts and a conference talk. As of right now there is no clear process by which the community decides the future of Ruby, or what counts as a legitimate Ruby implementation. Matz is a benevolent dictator–but maybe not for life. His implementation is known to some as MRI–“Matz’s Ruby Implementation,” with the implication that this is just one of many.

Shirai is proposing a process by which the Ruby community could depersonalize such decisions by moving to a decision-making council. This depersonalization of power relations is at the heart of what it means to institutionalize. Shirai’s process consists of seven steps:

  1. Ruby Design Council made up of representatives from any significant Ruby implementation, where significant means able to run a base level of RubySpec (which is to be determined).
  2. A proposal for a Ruby change can be submitted by any member of the Ruby Design Council. If a member of the larger Ruby community wishes to submit a proposal, they must work with a member of the Council.
  3. The proposal must meet the following criteria:
    1. An explanation, written in English, of the change, what use cases or problems motivates the change, how existing libraries, frameworks, or applications may be affected.
    2. Complete documentation, written in English, describing all relevant aspects of the change, including documentation for any specific methods whose behavior changes or behavior of new methods that are added.
    3. RubySpecs that completely describe the behavior of the change.
  4. When the Council is presented with a proposal that meets the above criteria, any member can decide that the proposal fails to make a case that justifies the effort to implement the feature. Such veto must explain in depth why the proposed change is unsuitable for Ruby. The member submitting the proposal can address the deficiencies and resubmit.
  5. If a proposal is accepted for consideration, all Council members must implement the feature so that it passes the RubySpecs provided.
  6. Once all Council members have implemented the feature, the feature can be discussed in concrete terms. Any implementation, platform, or performance concerns can be addressed. Negative or positive impact on existing libraries, frameworks or applications can be clearly and precisely evaluated.
  7. Finally, a vote on the proposed change is taken. Each implementation gets one vote. Only changes that receive approval from all Council members become the definition of Ruby.

Step 3B is a particularly interesting one for students of politics. As you may have guessed, Matz is Japanese. (This is somewhat ironic since Ruby is the currently the most readable language for English speakers–see this example if you don’t believe me.) Many discussions about Ruby take place on Japanese message boards, and some non-Japanese developers have even learned Japanese so that they can participate in these discussions. English is the lingua franca of the international software development community, so Shirai’s proposal makes sense but it is not uncontroversial.

In Shirai’s own words this proposal would provide the Ruby community with a “technology for change.” That is exactly what political institutions are for–organizing the decision-making capacity of a community. This proposal and its eventual acceptance, rejection, or modification by the Ruby community will be interesting for students of politics to keep an eye on, and may be the topic of future posts.

RubyMotion for Complete Beginners

HelloWorldRubyMotionAccording to the RubyMotion guide for getting started, “RubyMotion is a toolchain that permits the development of iOS applications using the Ruby programming language.” In less formal terms it lets you write iOS apps in Ruby using your favorite development environment rather than Apple’s unpopular XCode IDE. This post assumes you have gone through the guide above but don’t have much other iOS development experience.

My number one recommendation for anyone coming to RubyMotion for the first time would be RubyMotion: iOS Development with Ruby by Clay Allsopp. This book has the quality we have come to expect from the Pragmatic Programmers. The code examples are clear and well-documented, encouraging you to work hands-on with RubyMotion from the first chapter.  The book website over at PragProg also includes a discussion forum where the author personally answers questions.

The Pragmatic Programmers also put out a RubyMotion screencast. Screencasts are popular within the Ruby-Rails community and seem to already be fairly widespread within the RubyMotion world.

My favorite RubyMotion screencast to date is Jamon Holmgren’s tutorial for making an app that displays several YouTube videos. In all I am pretty sure I wrote just over 100 lines of code for this demo app. I am certain that in Objective-C the app would have required much more code, and have been less fun to write. This tutorial uses the ProMotion tool that Jamon also wrote, which helps organize your app behavior based on “screens.” You can find a tutorial for getting started with ProMotion here.

If you would like to know a bit more about the background of RubyMotion and the people who use it, there are two podcast episodes I would recommend. The first is the Ruby Rogues interview with Laurent Sansonetti, the creator of RubyMotion. The second is Episode 29 of the Giant Robots Smashing into Other Giant Robots podcast (“The Most Ironic iOS Developer”). In that episode Ben Orenstein interviews two thoughtbot developers, one of whom uses primarily RubyMotion and the other uses Objective-C almost exclusively but has a bit of experience with RubyMotion. They give some nice perspective on the pros and cons of RubyMotion for iOS development and the show notes provide a number of other resources.

To keep up with new resources in the RubyMotion community, check out RubyMotion Tutorials. The code that I have written during this initial learning process is on Github.

Update 1: I forgot to mention that RubyMotion also offers a generous discount for an educational license if you do not plan to sell your apps on the App Store.

Update 2: Jamon Holmgren tweeted a new version of his tutorial this morning (1/30/13).

How to Get FIPS Codes from Latitude and Longitude

FIPS codes are unique identifiers for geographic units within the US. Although they have technically been withdrawn as a standard, they are still widely used in political science and other applications for geographic categorization of data. For example, the CBS/New York Times monthly polling series includes the FIPS code for the county in which each respondent lives.

Say you have some other data with latitude and longitude indicators that you would like to combine with FIPS-coded data. I have written a short Ruby script below that will do exactly this. It assumes that you have your data in .csv format, since that is a pretty generic format and you can usually convert your data to that if it is currently stored in another form. You will also need the Ruby geokit gem:

gem install geokit

Once you have the data ready and the gem installed, you are good to go. Just fill out the lines with comments and run the following from IRB (or however you like to run your Ruby scripts):

require 'geokit'
require 'CSV'

filename = # csv file
fipslist = []

CSV.foreach(filename) do |row|
  lat = # latitude column
  long = # longitude column
  ll = GeoKit::LatLng.new(lat, long)
  fcc = Geokit::Geocoders::FCCGeocoder.reverse_geocode(ll)
  puts fcc.district_fips
  fipslist << fcc.district_fips

You can then do anything you want to with the fipslist object, including writing it out to a file. If you want to share improvements or have questions, please use the comments section below.

Wednesday Nerd Fun: Oximetry with Ruby and R

These posts are getting pretty esoteric, which may be a sign that I should put the series on hold for a while. Feed back is welcome. In any event, here’s some midweek entertainment for the coders among you:

A popular and fast way to effectively get the heart rate is pulse oximetry. A pulse oximeter is a device placed on a thin part of a person’s body, often a fingertip or earlobe. Light of different wavelengths (usually red and infrared) is then passed through that part of the body to a photodetector. The oximeter works by measuring the amounts of red and infrared light absorbed by the hemoglobin and oxyhemoglobin in the blood to determine how oxygenated the blood is. Because this absorption happens in pulses as the heart pumps oxygenated blood throughout the body, the heart rate can also be determined.

We are not going to build an oximeter, but in this post we’ll use the same concepts used in oximetry to determine the heart rate. We will record a video as we pass light through our finger for a short duration of time. With each beat of the heart, more or less blood flows through our body, including our finger. The blood flowing through our finger will block different amounts of the light accordingly. If we calculate the light intensity of each frame of the video we captured, we can chart the amount of blood flowing through our finger at different points in time, therefore getting the heart rate.

Continue here.

Merging Arthur Banks’ Time Series with COW

Recently I needed to combine data from two of the most widely used (at least in my subfield) cross-national time-series data sets: Arthur Banks’ time series and the Correlates of War Project (COW). Given how often these data sets are used, I was a bit surprised that I could not find a record of someone else combining them. The closest attempt I could find was Andreas Beger’s country names to COW codes do file.

Beger’s file has all of the country names in lower-case, so I used Ruby’s upcase command to change that. That change took care of just over 75 percent of the observations (10,396 of 14,523). Next, I had to deal with the fact that a bunch of the countries in Arthur Banks’ data do not exist any more (they have names like Campuchea, Ceylon, and Ciskei; see here and here). This was done with the main file. After that, the data was all set in Stata as desired.

I am not going to put the full combined data up because the people in control of Arthur Banks’ time series are really possessive. But if you already have both data sets, combining them should be much easier using these scripts.

Getting Started with Prediction

From historians to financial analysts, researchers of all stripes are interested in prediction. Prediction asks the question, “given what I know so far, what do I expect will come next?” In the current political season, presidential election forecasts abound. This dates back to the work of Ray Fair, whose book is ridiculously cheap on Amazon. In today’s post, I will give an example of a much more basic–and hopefully, relatable–question: given the height of a father, how do we predict the height of his son?

To see how common predictions about children’s traits are, just Google “predict child appearance” and you will be treated to a plethora of websites and iPhone apps with photo uploads. Today’s example is more basic and will follow three questions that we should ask ourselves for making any prediction:

1. How different is the predictor from its baseline?
It’s not enough to just have a single bit of information from which to predict–we need to know something about the baseline of the information we are interested in (often the average value) and how different the predictor we are using is. The “Predictor” in this case will refer to the height of the father, which we will call U. The “outcome” in this case will be the height of the son, which we will call V.

To keep this example simple let us assume that U and V are normally distributed–in other words their distributions look like the familiar “bell curve” when they are plotted. To see how different our given observations of U or V are from their baseline, we “standardize” them into X and Y

X = {{u - \mu_u} \over \sigma_u }

Y = {{v - \mu_v} \over \sigma_v },

where \mu is the mean and \sigma is the standard deviation. In our example, let \mu_u = 69, \mu_v=70, and \sigma_v = \sigma_u = 2.

2. How much variance in the outcome does the predictor explain?
In a simple one-predictor, one-outcome (“bivariate”) example like this, we can answer question #2 by knowing the correlation between  X and Y, which we will call \rho (and which is equal to the correlation between U and V in this case). For simplicity’s sake let’s assume \rho={1 \over 2}. In real life we would probably estimate \rho using regression, which is really just the reverse of predicting. We should also keep in mind that correlation is only useful for describing the linear relationship between X and Y, but that’s not something to worry about in this example. Using \rho, we can set up the following prediction model for Y:

Y= \rho X + \sqrt{1-\rho^2} Z.

Plugging in the values above we get:

Y= {1 \over 2} X + \sqrt{3 \over 4} Z.

Z is explained in the next paragraph.

3. What margin of error will we accept? No matter what we are predicting, we have to accept that our estimates are imperfect. We hope that on average we are correct, but that just means that all of our over- and under-estimates cancel out. In the above equation, Z represents our errors. For our prediction to be unbiased there has to be zero correlation between X and Z. You might think that is unrealistic and you are probably right, even for our simple example. In fact, you can build a decent good career by pestering other researchers with this question every chance you get. But just go with me for now. The level of incorrect prediction that we are able to accept affects the “confidence interval.” We will ignore confidence intervals in this post, focusing instead on point estimates but recognizing that our predictions are unlikely to be exactly correct.

The Prediction

Now that we have set up our prediction model and nailed down all of our assumptions, we are ready to make a prediction. Let’s predict the height of the son of a man who is 72″ tall. In probability notation, we want


which is the expected son’s height given a father with a height of 72”.

Following the steps above we first need to know how different 72″ is from the average height of fathers.  Looking at the standardizations above, we get

X = {U-69 \over 2}, and

Y = {V - 70 \over 2}, so

\mathbb{E}(V|U=72) = \mathbb{E}(2Y+70|X=1.5) = \mathbb{E}(2({1 \over 2}X + \sqrt{3 \over 4}Z)+70|X=1.5),

which reduces to 1.5 + \mathbb{E}(Z|X=1.5) + 70. As long as we were correct earlier about Z not depending on X and having an average of zero, then we get a predicted son’s height of 71.5 inches, or slightly shorter than his dad, but still above average.

This phenomenon of the outcome (son’s height) being closer to the average than the predictor (father’s height) is known as regression to the mean and it is the source of the term “regression” that is used widely today in statistical analysis. This dates back to one of the earliest large-scale statistical studies by Sir Francis Galton in 1886, entitled, “Regression towards Mediocrity in Hereditary Stature,” (pdf) which fits perfectly with today’s example.

Further reading: If you are already comfortable with the basics of prediction, and know a bit of Ruby or Python, check out Prior Knowledge.

Eliminate File Redundancy with Ruby

Say you have a file with many repeated, unnecessary lines that you want to remove. For safety’s sake, you would rather make an abbreviated copy of the file rather than replace it. Ruby makes this a cinch. You just iterate over the file, putting all lines the computer has already “seen” into a dictionary. If a line is not in the dictionary, it must be new, so write it to the output file. Here’s the code designed with .tex files in mind, but easily adaptable:

puts 'Filename?'
filename = gets.chomp
input = File.open(filename+'.tex')
output = File.open(filename+'2.tex', 'w')
seen = {}
input.each do |line|
  if (seen[line]) 
    seen[line] = true

Where would this come in handy? Well, the .tex extension probably already gave you a clue that I am reducing redundancy in a \LaTeX file. In particular, I have an R plot generated as a tikz graphic. The R plot includes a rug at the bottom (tick marks indicating data observations)–but the data set includes over 9,000 observations, so many of the lines are drawn right on top of each other. The \LaTeX compiler got peeved at having to draw so many lines, so Ruby helped it out by eliminating the redundancy. One special tweak for using the script above to modify tikz graphics files is to change the line

if (seen[line])


if (seen[line]) && !(line.include? 'node') &&  !(line.include? 'scope') && !(line.include? 'path') && !(line.include? 'define')

if your plot has multiple panes (e.g. par(mfrow=c(1,2)) in R) so that Ruby won’t ignore seemingly redundant lines that are actually specifying new panes. The modified line is a little long and messy, but it works, and that was the main goal here. The resulting \LaTeX file compiles easily and more quickly than it did with all those redundant lines, thanks to Ruby.