RubyMotion for Complete Beginners

HelloWorldRubyMotionAccording to the RubyMotion guide for getting started, “RubyMotion is a toolchain that permits the development of iOS applications using the Ruby programming language.” In less formal terms it lets you write iOS apps in Ruby using your favorite development environment rather than Apple’s unpopular XCode IDE. This post assumes you have gone through the guide above but don’t have much other iOS development experience.

My number one recommendation for anyone coming to RubyMotion for the first time would be RubyMotion: iOS Development with Ruby by Clay Allsopp. This book has the quality we have come to expect from the Pragmatic Programmers. The code examples are clear and well-documented, encouraging you to work hands-on with RubyMotion from the first chapter.  The book website over at PragProg also includes a discussion forum where the author personally answers questions.

The Pragmatic Programmers also put out a RubyMotion screencast. Screencasts are popular within the Ruby-Rails community and seem to already be fairly widespread within the RubyMotion world.

My favorite RubyMotion screencast to date is Jamon Holmgren’s tutorial for making an app that displays several YouTube videos. In all I am pretty sure I wrote just over 100 lines of code for this demo app. I am certain that in Objective-C the app would have required much more code, and have been less fun to write. This tutorial uses the ProMotion tool that Jamon also wrote, which helps organize your app behavior based on “screens.” You can find a tutorial for getting started with ProMotion here.

If you would like to know a bit more about the background of RubyMotion and the people who use it, there are two podcast episodes I would recommend. The first is the Ruby Rogues interview with Laurent Sansonetti, the creator of RubyMotion. The second is Episode 29 of the Giant Robots Smashing into Other Giant Robots podcast (“The Most Ironic iOS Developer”). In that episode Ben Orenstein interviews two thoughtbot developers, one of whom uses primarily RubyMotion and the other uses Objective-C almost exclusively but has a bit of experience with RubyMotion. They give some nice perspective on the pros and cons of RubyMotion for iOS development and the show notes provide a number of other resources.

To keep up with new resources in the RubyMotion community, check out RubyMotion Tutorials. The code that I have written during this initial learning process is on Github.

Update 1: I forgot to mention that RubyMotion also offers a generous discount for an educational license if you do not plan to sell your apps on the App Store.

Update 2: Jamon Holmgren tweeted a new version of his tutorial this morning (1/30/13).

Automatically Setup New R and LaTeX Projects

You have a finite amount of keystrokes in your life. Automating repetitive steps is one of the great benefits of knowing a bit about coding, even if the code is just a simple shell script. That’s why I set up an example project on Github using a Makefile and sample article (inspired by Rob Hyndman). This post explains the structure of that project and explains how to modify it for your own purposes.

Running the Makefile with the example article in an otherwise-empty project directory will create:

  • a setup.R file for clearing the workspace, setting paths, and loading libraries
  • a data directory for storing files (csv, rda, etc)
  • a drafts directory for LaTeX, including a generic starter article
  • a graphics library for storing plots and figures to include in your article
  • an rcode directory for your R scripts

It also supplies some starter code for the setup.R file in the main directory and a start.R file in the rcode directory. This takes the current user and sets relative paths to the project directories with simple variable references in R. For example, after running setup.R in your analysis file you can switch to the data directory with setwd(pathData), then create a plot and save it after running setwd(pathGraphics). Because of the way the setup.R file works, you could have multiple users working on the same project and not need to change any of the other scripts.

If you want to change this structure, there are two main ways to do it. You can add more (or fewer) directories by modifying the mkdir lines in the Makefile. You can add (or remove) text in the R files by changing the echo lines.

If you decide to make major customizations to this file or already have your own structure for projects like this, leave a link in the comments for other readers.

H/T to Josh Cutler for encouraging me to write a post about my project structure and automation efforts. As a final note, this is the second New Year’s Eve in a row that I have posted a tech tip. Maybe it will become a YSPR tradition!

Update: If you’re interested in your own automated project setup, check out ProjectTemplate by John Myles White. Thank to Trey Causey on Twitter and Zachary Jones in the comments for sharing this.

How to Get FIPS Codes from Latitude and Longitude

FIPS codes are unique identifiers for geographic units within the US. Although they have technically been withdrawn as a standard, they are still widely used in political science and other applications for geographic categorization of data. For example, the CBS/New York Times monthly polling series includes the FIPS code for the county in which each respondent lives.

Say you have some other data with latitude and longitude indicators that you would like to combine with FIPS-coded data. I have written a short Ruby script below that will do exactly this. It assumes that you have your data in .csv format, since that is a pretty generic format and you can usually convert your data to that if it is currently stored in another form. You will also need the Ruby geokit gem:

gem install geokit

Once you have the data ready and the gem installed, you are good to go. Just fill out the lines with comments and run the following from IRB (or however you like to run your Ruby scripts):

require 'geokit'
require 'CSV'

filename = # csv file
fipslist = []

CSV.foreach(filename) do |row|
  lat = # latitude column
  long = # longitude column
  ll = GeoKit::LatLng.new(lat, long)
  fcc = Geokit::Geocoders::FCCGeocoder.reverse_geocode(ll)
  puts fcc.district_fips
  fipslist << fcc.district_fips
end

You can then do anything you want to with the fipslist object, including writing it out to a file. If you want to share improvements or have questions, please use the comments section below.

Merging Arthur Banks’ Time Series with COW

Recently I needed to combine data from two of the most widely used (at least in my subfield) cross-national time-series data sets: Arthur Banks’ time series and the Correlates of War Project (COW). Given how often these data sets are used, I was a bit surprised that I could not find a record of someone else combining them. The closest attempt I could find was Andreas Beger’s country names to COW codes do file.

Beger’s file has all of the country names in lower-case, so I used Ruby’s upcase command to change that. That change took care of just over 75 percent of the observations (10,396 of 14,523). Next, I had to deal with the fact that a bunch of the countries in Arthur Banks’ data do not exist any more (they have names like Campuchea, Ceylon, and Ciskei; see here and here). This was done with the main file. After that, the data was all set in Stata as desired.

I am not going to put the full combined data up because the people in control of Arthur Banks’ time series are really possessive. But if you already have both data sets, combining them should be much easier using these scripts.

Getting Started with Prediction

From historians to financial analysts, researchers of all stripes are interested in prediction. Prediction asks the question, “given what I know so far, what do I expect will come next?” In the current political season, presidential election forecasts abound. This dates back to the work of Ray Fair, whose book is ridiculously cheap on Amazon. In today’s post, I will give an example of a much more basic–and hopefully, relatable–question: given the height of a father, how do we predict the height of his son?

To see how common predictions about children’s traits are, just Google “predict child appearance” and you will be treated to a plethora of websites and iPhone apps with photo uploads. Today’s example is more basic and will follow three questions that we should ask ourselves for making any prediction:

1. How different is the predictor from its baseline?
It’s not enough to just have a single bit of information from which to predict–we need to know something about the baseline of the information we are interested in (often the average value) and how different the predictor we are using is. The “Predictor” in this case will refer to the height of the father, which we will call U. The “outcome” in this case will be the height of the son, which we will call V.

To keep this example simple let us assume that U and V are normally distributed–in other words their distributions look like the familiar “bell curve” when they are plotted. To see how different our given observations of U or V are from their baseline, we “standardize” them into X and Y

X = {{u - \mu_u} \over \sigma_u }

Y = {{v - \mu_v} \over \sigma_v },

where \mu is the mean and \sigma is the standard deviation. In our example, let \mu_u = 69, \mu_v=70, and \sigma_v = \sigma_u = 2.

2. How much variance in the outcome does the predictor explain?
In a simple one-predictor, one-outcome (“bivariate”) example like this, we can answer question #2 by knowing the correlation between  X and Y, which we will call \rho (and which is equal to the correlation between U and V in this case). For simplicity’s sake let’s assume \rho={1 \over 2}. In real life we would probably estimate \rho using regression, which is really just the reverse of predicting. We should also keep in mind that correlation is only useful for describing the linear relationship between X and Y, but that’s not something to worry about in this example. Using \rho, we can set up the following prediction model for Y:

Y= \rho X + \sqrt{1-\rho^2} Z.

Plugging in the values above we get:

Y= {1 \over 2} X + \sqrt{3 \over 4} Z.

Z is explained in the next paragraph.

3. What margin of error will we accept? No matter what we are predicting, we have to accept that our estimates are imperfect. We hope that on average we are correct, but that just means that all of our over- and under-estimates cancel out. In the above equation, Z represents our errors. For our prediction to be unbiased there has to be zero correlation between X and Z. You might think that is unrealistic and you are probably right, even for our simple example. In fact, you can build a decent good career by pestering other researchers with this question every chance you get. But just go with me for now. The level of incorrect prediction that we are able to accept affects the “confidence interval.” We will ignore confidence intervals in this post, focusing instead on point estimates but recognizing that our predictions are unlikely to be exactly correct.

The Prediction

Now that we have set up our prediction model and nailed down all of our assumptions, we are ready to make a prediction. Let’s predict the height of the son of a man who is 72″ tall. In probability notation, we want

\mathbb{E}(V|U=72),

which is the expected son’s height given a father with a height of 72”.

Following the steps above we first need to know how different 72″ is from the average height of fathers.  Looking at the standardizations above, we get

X = {U-69 \over 2}, and

Y = {V - 70 \over 2}, so

\mathbb{E}(V|U=72) = \mathbb{E}(2Y+70|X=1.5) = \mathbb{E}(2({1 \over 2}X + \sqrt{3 \over 4}Z)+70|X=1.5),

which reduces to 1.5 + \mathbb{E}(Z|X=1.5) + 70. As long as we were correct earlier about Z not depending on X and having an average of zero, then we get a predicted son’s height of 71.5 inches, or slightly shorter than his dad, but still above average.

This phenomenon of the outcome (son’s height) being closer to the average than the predictor (father’s height) is known as regression to the mean and it is the source of the term “regression” that is used widely today in statistical analysis. This dates back to one of the earliest large-scale statistical studies by Sir Francis Galton in 1886, entitled, “Regression towards Mediocrity in Hereditary Stature,” (pdf) which fits perfectly with today’s example.

Further reading: If you are already comfortable with the basics of prediction, and know a bit of Ruby or Python, check out Prior Knowledge.

How to Count Words in LaTex Documents

One thing that can be hard to adjust to for new LaTeX users is not being able to easily get a word count relative to other programs. Now there is a solution, thanks to Matthias Orlowski and Alex Iliopoulos. I share Matthias’s instructions here, with a couple notes of my own at the end.

1. Check whether Perl is installed by typing “which perl” in Terminal. That should be the case since Mac OS X ships with an installation of Perl.

2. Check whether texcount is installed by typing “which texcount” to Terminal. That should also be the case if you installed MacTex recently. If not, install it via TeXLive.

3. Copy the code below to your Preferences.el file which should be located at ~/Library/Preferences/Aquamacs Emacs

; count words in latex docs
(defun latex-word-count ()
  (interactive)
  (shell-command (concat "PATH "
    "-inc "; texcount option (set to count documents included via \input)
    (buffer-file-name))))

; that's [ctrl-c w] as the hotkey
(global-set-key (quote [f6]) 'latex-word-count)

“PATH ” is the path returned by ‘which texcount’ in (2), and the space at the end is important.

4. Restart Aquamacs and open a Tex file. Hit F6 and see the magic happen!

Now, obviously these instructions are only for OS X, but I would be happy to share Windows instructions if someone wants to adapt them. The first thing to look out for is that on recent OS X Macbooks the default for F6 is to brighten the keyboard backlight. You can change this by going into the Keyboard section on System Preferences and selecting “Use all F1, F2, etc. keys as standard function keys.” The second potential problem is if the path of the Tex file you are using includes spaces; this can possibly be addressed by putting a forward slash (\) instead of the space, but I cannot verify that from experience.

If you run into other issues with this script, or have new adaptations of it, feel free to leave them in the comments.

Find Unique Values in List (Python)

I Googled this problem today and found a lot of people creating long-ish functions to find unique values. There is a much simpler solution: turn your list into a set (which removes duplicates) and then back into a list.

newList = list(set(oldList))

Hope this is of help to someone. Apologies if this is already a well-known solution. Perhaps “set” was unavailable in earlier (pre-2.7) versions of Python.

Update: This post generates a fair amount of traffic. Peter Bengtsson has some other options if you care more about speed/ordering/etc rather than simplicity, which was the goal here.

Finding a Series of Confidence Intervals in R

[Note: While many of my posts appeal only to readers with certain interests–specifically, mine–this one is meant to provide a public good in the form of an R script that can be run to find multiple confidence intervals around the same sample value. This and other methodological posts in the future may not appeal to the general reader, so they are posted here under the category “Technical.” Read only the “Uncategorized” posts if you prefer my random miscellany. Comments from readers of all methodological traditions and experience levels are invited.]

Purpose: Find a series of lower and higher bounds for the confidence interval around a sample statistic.

Script:

###################################################
# Computing a Series of Confidence Levels in R
# Matt Dickenson
# yspr dot wordpress dot com
# Released under a Creative Commons Licence
###################################################

# INSTRUCTIONS
# First, you will need your desired confidence levels, sample statistic, and standard error of the sample statistic
# Compute those, and then run this script
CONFIDENCE <- function(x,y,se){
  intervals <- matrix(NA,nrow=(length(x)),ncol=2)
  levels <- matrix(rep(x,2),nrow=(length(x)),ncol=2)
  rnames <- c()

  for (i in 1:(length(x)*2)){
    if(i %% 2 != 0){
      zi <- qnorm((1-(levels[(floor(i/2)+1),1]))/2)
      low <- y+(zi*se)
      intervals[floor(i/2)+1,1] <- low
      }
      else{
        zi <- qnorm((1-(levels[(i/2),2]))/2)
        high <- y-(zi*se)
        intervals[(i/2),2] <- high

        }
      }
      row.names(intervals) <- x
      colnames(intervals) <-c("Lower Bound", "Higher Bound")
      intervals
}
# Now, run the command as
CONFIDENCE(x,y,se)
# CONDITIONS:
# where x is the vector containing your desired confidence levels (0<x[i]<1 for all i)
# and y is your sample statistic and se is the standard error of your sample statistic
# note that the actual variable names can be anything you want, as long as they are entered in this order

Notice that in less than twenty actual lines of code we have developed a function that can find any number of normal distribution confidence intervals. It is also flexible, and with just a bit of tweaking you could change this to another distribution, like Student’s t, binomial, or Poisson. Here is an example of one use for this function:

# EXAMPLE
# Loading data on subprime mortgages from Coral Gables-Fort Myers Florida
load("subprime.RData")
head(subprime)
# Finding the porportion of mortgage holders who had subprime mortgages
ptrue <- mean(subprime$high.rate, na.rm=T)

# Drawing a simple random sample of the population
set.seed(126)
phat <- mean(subprime.sample, na.rm=T)
phat
# =0.22
# computing standard error of sample mean
sampleSE <- sqrt((phat*(1-phat)/length(subprime.sample)))

# Say we want to find the 50, 95, and 99 percent confidence interval estimates...
conlev <- c(.50,.95,.99)
CONFIDENCE(conlev, phat, sampleSE)
> CONFIDENCE(conlev, phat, sampleSE)
     Lower Bound Higher Bound
0.5    0.2023289    0.2376711
0.95   0.1686504    0.2713496
0.99   0.1525152    0.2874848
# What about including the ninety percent confidence intervals?
conlev2 <-c(.50, .90, .95, .99)
CONFIDENCE(conlev2, phat, sampleSE)
> CONFIDENCE(conlev2, phat, sampleSE)
     Lower Bound Higher Bound
0.5    0.2023289    0.2376711
0.9    0.1769061    0.2630939
0.95   0.1686504    0.2713496
0.99   0.1525152    0.2874848
# How about an interval for each percentile?
CONFIDENCE(seq(0,1,by=.01))
# output not shown for the sake of space, BUT
# We can graph these:
x <- seq(0,1,by=.01)
y <- CONFIDENCE(x,phat,sampleSE)
library(grDevices)
plot(x,y[,1], type="l", ylim=c(0,0.4), xlim=c(0,1), ylab="Confidence Interval", xlab=expression(alpha))
polygon(c(x[1],x[2:98],x[99],x[99],x[98:2],x[1]), c(y[1,2],y[2:98,2],y[99,2],y[99,1],y[98:2,1],y[1,1]), border="grey",col="grey")
lines(x,y[,1], type="l")
lines(x,y[,2], type="l")

Here is the plot:

graph1

The size of the confidence level (the grey area) increases as we seek more confidence about our estimate. Since we set the seed to 126, your random sample should be exactly the same as mine, which allows for direct comparison of results for the example.

# How does this compare to the true population proportion?
points(x,y=(rep(ptrue, length(x))),type="l", col="red")

Here is the new plot:

graph2
We can see that we have included the true population proportion by the time our confidence level grows to include 60 percent of the distribution of our sample proportion.

Feedback is welcome in the comments, especially links to scripts that modify this function.