Trade Secrets of Methodologists: A Bibliography

sciencemethWe all know what the scientific method looks like in idealized form. But the first dirty secret is that you don’t actually write a paper that way. In fact, many papers are written almost in reverse, starting with the findings and working backward. Over the weekend @Worse_Reviewer shared some papers that help to convey these secrets and make grad students aware of the tacit knowledge already put to good use by their more senior colleagues. I have obtained ungated links to the papers (or similar versions) wherever available, along with two additional articles via Mike Ward.

Recommended Packages for R 3.x

sandwichWith the recent release of R 3.0 (OS X) and 3.1 (Windows), I found myself in need of a whole host of packages for data analysis. Rather than discover each one I needed in the middle of doing real work, I thought it would be helpful to have a script with a list of essentials to speed up the process. This became even more essential when I also had to install R on a couple of machines in our department’s new offices.

Thankfully my colleague Shahryar Minhas had a similar idea and had already started a script, which I adapted and share here with his permission. The script is also on Github so if you have additions that you find essential on a new R install feel free to recommend them.

PACKAGES = c("Amelia",
	"apsrtable", 
	"arm",
	"car", 
	"countrycode", 
	"cshapes",
	"doBy",
	"filehash", 
	"foreign", 
	"gdata",
	"ggplot2",
	"gridExtra",
	"gtools",
	"Hmisc", 
	"lme4", 
	"lmer",
	"lmtest",
	"maptools",
	"MASS", 
	"Matrix",
	"mice", 
	"mvtnorm",
	"plm", 
	"plyr",
	"pscl",
	"qcc",
	"RColorBrewer", 
	"reshape", 
	"sandwich", 
	"sbgcop", 
	"scales", 
	"sp",
	"xlsx", 
	"xtable") 
install.packages(PACKAGES)
install.packages('tikzDevice', repos='http://r-forge.r-project.org')

Mapping Literal Place Names

Place names are another one of those micro-institutions. They often carry a linguistic legacy indicating some important discoverer, inhabitant, or conqueror. Changes in place names are significant too. (Would Sinatra’s “New Amsterdam, New Amsterdam” have rolled off the tongue nearly as nicely?) As the names accumulate history and new generations become accustomed to them, however, we often lose the literal sense of their meaning. In an effort to help undo that, the Atlas of True Names ”reveals the etymological roots, or original meanings,of the familiar terms on today’s maps of the World, Europe, the British Isles and the United States.”

Here are a couple of examples, and there is much more at the link:

sample_us_west

true-names

Facelift

facelift-lucille-arrested-developmentAs regular readers will have noticed by now, the site got a facelift about a week ago. This was much needed as the old theme had become outdated (literally–it was named “Twenty-Ten”). Some of the formatting of old posts might have been thrown off by the change. Overall, I hope you will find it a pleasant upgrade.

One thing that didn’t fly well early on was the light font I had chosen for the text in posts–in some browsers it rendered as very low-contrast. I changed it a couple of days later and hopefully it’s better now. Thanks to Daniel for commenting about this via email.

Another upcoming change is that I will be less rigorous about the Monday-Wednesday-Friday schedule of posts. Posts will likely be published at those times when they are available, but I will not make a point of posting on all three of those days every week.

The third and final change is that comments will automatically close after seven days. This is mainly because comments after that time are typically spam. If you have a response to a post more than a week after the fact, you are more than welcome to contact me on email or Twitter. And feedback about any of these changes is welcome!

Project Design as Reproducibility Aid

From the Political Science Replication blog:

When reproducing pubished work, I’m often annoyed that methods and models are not described in detail. Even if it’s my own project, I sometimes struggle to reconstruct everything after I took a longer break from a project. An article by post-docs Rich FitzJohn and Daniel Falster shows how to set up a project structure that makes your work reproducible.

To get that “mix” into a reproducible format, post-docs Rich FitzJohn and Daniel Falster from Macquarie University in Australia suggest to use the same template for all projects. Their goal is to ensure integrity of their data, portability of the project, and to make it easier to reproduce your own work later. This can work in R, but in any other software as well.

Here’s their gist. My post from late last year suggests a similar structure. PSR and I were both informed about ProjectTemplate based on these posts–check it out here.

Micro-Institutions Everywhere: Defining Death

From the BBC:

In the majority of cases in hospitals, people are pronounced dead only after doctors have examined their heart, lungs and responsiveness, determining there are no longer any heart and breath sounds and no obvious reaction to the outside world….

Many institutions in the US and Australia have adopted two minutes as the minimum observation period, while the UK and Canada recommend five minutes. Germany currently has no guidelines and Italy proposes that physicians wait 20 minutes before declaring death, particularly when organ donation is being considered….

But the criteria used to establish brain death have slight variations across the globe.

In Canada, for example, one doctor is needed to diagnose brain death; in the UK, two doctors are recommended; and in Spain three doctors are required. The number of neurological tests that have to be performed vary too, as does the time the body is observed before death is declared.

George Box, the Accidental Statistician

GeorgeEPBoxGeorge Box, renowned statistician, passed away on April 10 of this year at the age of 93. As the title of his recently released memoir suggests, he stumbled into the career that made him famous. During the Second World War, he was assigned to the Chemical Defence Experimental Station, located at Porten Down. From there, as he recounts,

[M]y job was to make biochemical determinations in experiments on small animals. The results I was getting were very variable, and I told Cullumbine that what we needed was a statistician to analyze our data. He said, “Yes, but we can’t get one. What do you know about it?” I told him I had once tried to read a book about it by someone called R.A. Fisher, but I hadn’t understood it. He said, “Well you read the book so you’d better do it.” So I said, “Yes Sir.” (Kindle Locations 750-754).

I found this book useful because so many biographies are written as if the protagonist had his or her life all planned out from the beginning. Autobiographies are a bit more honest on this front, but none as much as Box’s.

This is particularly helpful for grad students, who tend to get advice from a very biased sample: successful academics. From their accounts we can estimate the probability that someone successful took a certain course of action. But without information on those who do not become academics, it’s impossible to obtain the probability of success when adopting that same strategy. Box’s memoir alone can’t entirely undo this, of course, but he does relate stories of many of his grad students who chose positions in industry.

Here are some quotations from it that I enjoyed:

  • “A serious mistake has been made in classifying statistics as part of the mathematical sciences. Rather it should be regarded as a catalyst to scientific method itself.” (Kindle Locations 545-546).
  • “I forget whom I lied to (I expect it was the Army— they were used to it), but I did get my discharge.” (Kindle Locations 930-931).
  • “Likelihood methods are like a very intelligent but nondiscriminating child.” (Kindle Location 2024).
  • “None of this is a hanging matter.” (Kindle Location 2029).
  • “Originality and wit are very close.” (Kindle Location 2340).

The main weakness of the book is its meandering style. Box often goes from anecdote to anecdote in a train of thought style where the logic of transition is unclear to the reader. This becomes less irritating by the second half of the book, either because it received better editing or because I grew used to the style.

Overall, I recommend the book to several audiences: grad students in any quantitative field, practicing statisticians, and those who would like to know more about the personal life of this influential figure.

Managing Memory and Load Times in R and Python

Once I know I won’t need a file again, it’s gone. (Regular back-ups with Time Machine have saved me from my own excessive zeal at least once.) Similar economy applies to runtime: My primary computing device is my laptop, and I’m often too lazy to fire up a cloud instance unless the job would take more than a day.

Working with GDELT data for the last few weeks I’ve had to be a bit less conservative than usual. Habits are hard to break, though, so I found myself looking for a way to

  1. keep all the data on my hard-drive, and
  2. read it into memory quickly in R and/or Python.

The .zip files you can obtain from the GDELT site accomplish (1) but not (2). A .rda
binary helps with part of (2) but has the downside of being a binary file that I might not be able to open at some indeterminate point in the future–violating (1). And a memory-hogging CSV that also loads slowly is the worst option of all.

So what satisficing solution did I reach? Saving gzipped files (.gz). Both R and Python can read these files directly (R code shown below; in Python use gzip.open
or the compression option for read_csv in pandas). It’s definitely smaller–the 1979 GDELT historical backfile compresses from 115.3MB to 14.3MB (an eighth of its former size). Reading directly into R from a .gz file has been available since at least version 2.10.

Is it faster? See for yourself:

> system.time(read.csv('1979.csv', sep='\t', header=F, flush=T, as.is=T) )
user system elapsed
48.930 1.126 50.918
> system.time(read.csv('1979.csv.gz', sep='\t', header=F, flush=T, as.is=T))
user system elapsed
23.202 0.849 24.064
> system.time(load('gd1979.rda'))
user system elapsed
5.939 0.182 7.577

Compressing and decompressing .gz files is straightforward too. In the OS X Terminal, just type gzip filename or gunzip filename, respectively

Reading the gzipped file takes less than half as long as the unzipped version. It’s still nowhere near as fast as loading the rda binary, but I don’t have to worry about file readability for many years to come given the popularity of *nix operating systems. Consider using .gz files for easy memory management and quick loading in R and Python.

Statistics as Principled Argument

correlationThat’s the title of a book I recently came across by the late Robert P. Abelson. The thesis of the book is that statistics is a tool for organizing an argument. Abelson’s focus is his own discipline of psychology but many of his points apply to social science more broadly.

Throughout the book Abelson accumulates a list of his “laws”:

  1. Chance is lumpy.
  2. Overconfidence abhors uncertainty.
  3. Never flout a convention just once.
  4. Don’t talk Greek if you don’t know the English translation.
  5. If you have nothing to say, don’t say anything.
  6. There is no free hunch.
  7. You can’t see the dust if you don’t move the couch.
  8. Criticism is the mother of methodology.

My main gripe with the book is how much of it hinders on frequentist hypothesis testing. For example, I don’t consider the difference between a p-value of .05 and one of .07 to be a “principled argument.” Abelson does give some attention to Bayesian methods, but a book developing the idea of statistics as rhetoric from a Bayesian point of view would be more coherent.  Perhaps we will see something along these lines from Andrew Gelman’s work on ethical statistics.

Dissertations as Essays Rather than Treatises

The essay format is increasingly popular in economics, according to a new paper by Wendy Stock and John Siegfried in the American Economic Review (gated, ungated). They find that “most of the evidence suggests that essay-style dissertations enhance economists’ early career research productivity.”

Here are some other trends they identify:

  • Economics dissertations in the form of essays rose from 0.3 percent of the total in 1970 to 69 percent in 2010
  • Economists who take an academic position are more likely to have written a dissertation consisting of essays (it would be interesting to see this conditional probability reversed)
  • Students in higher-ranking programs, in the micro-economics subfield, and from outside of the US adopted this strategy earlier than others

I am grateful that my own department permits the multiple-essay format. Although I have not submitted a dissertation prospectus yet I anticipate that I will go this route myself.

[via Organizations and Markets]