About a month ago, Joshua Tucker posted some hypotheses about the number of tweets and likes that posts get on The Monkey Cage. Anton Strezhnev took up his question, building a screen scraper in Python and making all of his data public. His tentative conclusion was that posts containing graphics are more likely to be "Liked" than tweeted.

Coincidentally, Josh Cutler is teaching a course on Python for the Duke Political Science Department this semester, and one of our assignments was to build a blog scraper.* I took Anton's scraper as a starting point and built three more, to get data from Andrew Gelman's blog, Freakonomics, and Modeled Behavior. The idea behind these choices was to make comparisons between economics and political science blogs, and to have gradations of "wonkiness," another of the proposed hypotheses. Although it's pretty hard to escape wonkiness entirely in the academic blogosphere, here's how I see the categorization:

Here's what you can expect from this series (not necessarily in this order):

  • How do comments/tweets/likes correlate with page views?
  • How do comments predict (correlate with) tweets and likes?
  • What other factors predict tweets and like? (post length, images, time since previous post)
  • What predicts comments? (same potential explanations)
  • Are there author- or category- specific factors on the blogs?
My goal for this series, besides just answering the questions is to demonstrate the process of research to a broader audience. My own research process is decidedly imperfect, but by making everything--data collection scripts, data files, and R code for analysis--public, you will be able to see where judgment calls were made, or where mistakes might have crept in. If you have questions about the process, or criticism of my work, please comment on the posts or shoot me an email. Look for some preliminary charts later today, and real analysis to begin over the weekend/next week. 


Note: I'm not sure how the term "scraper" emerged, but it refers to a script that collects information from websites without doing any permanent damage to the website. Unless you forget to put in a time delay and crash the blog--but I'm not naming any names