Schneier on Data and Power

Data and Power is the tentative title of a new book, forthcoming from Bruce Schneier. Here’s more from the post describing the topic of the book:

Corporations are collecting vast dossiers on our activities on- and off-line — initially to personalize marketing efforts, but increasingly to control their customer relationships. Governments are using surveillance, censorship, and propaganda — both to protect us from harm and to protect their own power. Distributed groups — socially motivated hackers, political dissidents, criminals, communities of interest — are using the Internet to both organize and effect change. And we as individuals are becoming both more powerful and less powerful. We can’t evade surveillance, but we can post videos of police atrocities online, bypassing censors and informing the world. How long we’ll still have those capabilities is unclear….

There’s a fundamental trade-off we need to make as society. Our data is enormously valuable in aggregate, yet it’s incredibly personal. The powerful will continue to demand aggregate data, yet we have to protect its intimate details. Balancing those two conflicting values is difficult, whether it’s medical data, location data, Internet search data, or telephone metadata. But balancing them is what society needs to do, and is almost certainly the fundamental issue of the Information Age.

There’s more at the link, including several other potential titles. The topic will likely interest many readers of this blog. It will likely build on his ideas of inequality and online feudalism, discussed here.

Visualizing the Indian Buffet Process with Shiny

(This is a somewhat more technical post than usual. If you just want the gist, skip to the visualization.)

N customers enter an Indian buffet restaurant, one after another. It has a seemingly endless array of dishes. The first customer fills her plate with a Poisson(α) number of dishes. Each successive customer i tastes the previously sampled dishes in proportion to their popularity (the number of previous customers who have sampled the kth dish, m_k, divided by i). The ith customer then samples a Poisson(α) number of new dishes.

That’s the basic idea behind the Indian Buffet Process (IBP). On Monday Eli Bingham and I gave a presentation on the IBP in our machine learning seminar at Duke, taught by Katherine Heller. The IBP is used in Bayesian non-parametrics to put a prior on (exchangeability classes of) binary matrices. The matrices usually represent the presence of features (“dishes” above, or the columns of the matrix) in objects (“customers,” or the rows of the matrix). The culinary metaphor is used by analogy to the Chinese Restaurant Process.

Although the visualizations in the main paper summarizing the IBP are good, I thought it would be helpful to have an interactive visualization where you could change α and N to see how what a random matrix with those parameters looks like. For this I used Shiny, although it would also be fun to do in d3.

One realization of the IBP, with α=10.

One realization of the IBP, with α=10.

In the example above, the first customer (top row) sampled seven dishes. The second customer sampled four of those seven dishes, and then four more dishes that the first customer did not try. The process continues for all 10 customers. (Note that this matrix is not sorted into its left-ordered-form. It also sometimes gives an error if α << N, but I wanted users to be able to choose arbitrary values of N so I have not changed this yet.) You can play with the visualization yourself here.

Interactive online visualizations like this can be a helpful teaching tool, and the process of making them can also improve your own understanding of the process. If you would like to make another visualization of the IBP (or another machine learning tool that lends itself to graphical representation) I would be happy to share it here. I plan to add the Chinese restaurant process and a Dirichlet process mixture of Gaussians soon. You can find more about creating Shiny apps here.

A Chrome Extension for XKCD Substitutions

This morning’s XKCD had some fun suggestions for replacing key phrases to make news articles more fun:

Regular readers may recall my Doublespeak Chrome extension, which works on the same principle. In short order, I was able to create a new app, XKCDSub, that works the same way: install the extension, and when you click its icon it will open your current page in a new tab with the phrases replaced. Here is an example of the extension in action on Elon Musk’s Wikipedia page:


The code is open source on Github. You can find it in the Chrome webstore here.

Visualizing Political Unrest in Egypt, Syria, and Turkey

The lab of Michael D. Ward et al now has a blog. The inaugural post describes some of the lab’s ongoing projects that may come up in future entries including modeling of protests, insurgencies, and rebellions, event prediction (such as IED explosions), and machine learning techniques.

The second post compares two event data sets–GDELT and ICEWS–using recent political unrest in the Middle East as a focal point (more here):

We looked at protest events in Egypt and Turkey in 2011 and 2012 for both data sets, and we also looked at fighting in Syria over the same period…. What did we learn from these, limited comparisons?  First, we found out first hand what the GDELT community has been saying: the GDELT data are in BETA and currently have a lot of false positives. This is not optimal for a decision making aid such as ICEWS, in which drill-down to the specific events resulting in new predictions is a requirement. Second, no one has a good ground truth for event data — though we have some ideas on this and are working on a study to implement them. Third, geolocation is a boon. GDELT seems especially good a this, even with a lot of false positives.

The visualization, which I worked on as part of the lab, can be found here.  It relies on CartoDB to serve data from GDELT and ICEWS, with some preprocessing done using MySQL and R. The front-end is Javascript using a combination of d3 for timelines and Torque for maps.


GDELT (green) and ICEWS (blue) records of protests in Egypt and Turkey and conflict in Syria

If you have questions about the visualizations or the technology behind them, feel free to mention them here or on the lab blog.

Review: RubyMotion iOS Development Essentials

rm-ios-devRubyMotion is a continued topic of interest on this blog, and I will likely have more posts about it in the near future. At this stage I am still getting comfortable with iOS development, but I would much rather be doing it in the friendly playground of Ruby rather than the Objective-C world. In addition to the RubyMotion book from PragProg, the next resource I would recommend is RubyMotion iOS Development Essentials.*

The book takes a “zero-to-deploy” approach, which is great for beginners trying to get their first app into the App Store. The first few chapters  will be redundant for developers who have worked with RubyMotion before, but they provide a helpful introduction to how RM works and the Model-View-Controller paradigm.

For several chapters the book uses the same example application, a restaurant recommender reminiscent of Yelp. Demonstrating code by building up from a simple application is a nice way of presenting the application. By the time readers have worked through these chapters they will have an example app that is more interesting than many of the toy apps in shorter tutorials.

Later chapters will benefit novice and experienced developers alike, because they fill a gap in the RubyMotion literature. Many tutorials overlook the process of testing RM code, and testing iOS in general can be challenging. The testing chapter of this book goes over unit testing, functional testing, and tests that rely on device events such as gestures.

My favorite chapter in the book was chapter 6, which goes over device capabilities. At 46 pages this is the longest chapter in the book, covering GPS, gestures, Core Data, using the Address Book, and more. I especially enjoyed working through the section on accessing the camera and Photo Library. This is difficult to test on the simulator since there is (obviously) no access to a built-in camera (as with certain iOS devices including some iPod Touch models), but the example app covers how to handle this gracefully.

Stylistically, it can be a challenge to lay out a book that uses iOS API jargon like UIImagePickerControllerSourceTypePhotoLibrary. There were some gripes with the authors’ choice of two-space indenting, but that is my preference so it did not bother me. One addition I would have preferred would be additional formatting for the code, using either colors (for the e-book) or bolding (for the print version) to distinguish function names and help the reader keep their place in the code. The apps themselves rely mainly on iOS defaults. This is common in tutorials, but it also helps them look natural in iOS 7. Most of the time I was working through the book I used the iOS 6.1 simulator, but it was no problem to upgrade to iOS 7.

As a whole this book is a thorough introduction to RubyMotion development. It has several key features that are missing from other RubyMotion tutorials, including an emphasis on testing code. This book makes a great resource for new RubyMotion developers, or developers who want to use more of the device capabilities.

*Note: For this review I relied on the e-book version, which was provided as a review copy by Packt Publishing.

Technology and Government: San Francisco vs. New York

In a recent PandoMonthly interview, John Borthwick made a very interesting point. Many cities are trying to copy the success of Silicon Valley/Bay Area startups by being like San Francisco: hip, fun urban areas designed to attract young entrepreneurs and developers (Austin comes to mind). However, the relationship between tech and other residents is a strained one: witness graffiti to the effect of “trendy Google professionals raise housing prices” and the “startup douchebag” caricature.

New York, on the other hand, has a smaller startup culture (“Silicon Alley”) but much closer and more fruitful ties between tech entrepreneurs and city government. Mayor Bloomberg has been at the heart of this, with his Advisory Council on Technology and his 2012 resolution to learn to code. Bloomberg’s understanding of technology and relationship with movers and shakers in the industry will make him a tough act to follow.

Does this mean that the mayors of Chicago, Houston, or Miami need to be writing Javascript in their spare time? Of course not. But making an effort to understand and relate to technology professionals could yield great benefits.

Rather than trying to become the next Silicon Valley (a very tall order) it would be more efficacious for cities to follow New York’s model: ask not what your city can do for technology, but what technology can do for your city. Turn bus schedule PDF’s into a user-friendly app or–better yet, for many low-income riders–a service that allows you to text and see when the next bus will arrive. Instead of calling the city to set up services like water and garbage collection, add a form to the city’s website. The opportunities to make city life better for all citizens–not just developers and entrepreneurs–are practically boundless.

I was happy to see San Francisco take a small step in the right direction recently with the Open Law Initiative, but there is more to be done, and not just in the Bay Area. Major cities across the US and around the world could benefit from the New York model. See more of the Borthwick interview below:

Bash Script for Editing Playlist Files

music_knowledgeOver the weekend I was working on a playlist for a personal event coming up later this week. The playlist had about 5 hours of music–around 80 songs–that had been purchased from various sources over the years. In order to have a backup of the playlist on another machine I needed to get all of the music files in one place, so I used iTunes’ “Create AAC Version” tool.

The problem with this was that many of the new files were named in the format “#{song_number} #{song_title}.m4a”. For instance, “Dustland Fairytale” by The Killers was “05 Dustland Fairlytale.m4a.” I could’ve spent my Saturday night manually clicking through and editing the filenames, but fortunately I knew that with a little bash scripting I could automate the whole process.

Rather than show the entire script right away, I want to go through the process of composing a bash script to solve this type of problem. First, we know that we can rename files using the mv command:

mv oldfilename newfilename

Next, it’s important to know that you can loop through all .m4a files (or whatever other extension) in a given directory by:

for i in *.m4a
  [do stuff here]

Within the for loop we access the filename by "${i}". Your code inside the do block could be something simple like echo "${i}" or something more complicated spanning multiple lines. We can also index the filename strings in the form "${string:startindex:endindex}". If we leave off the last index it defaults to the end of the string. (You can also index from right to left, but we omit that for simplicity here.) All of the numbers I was dealing with in the playlist file were two-digits with a space separating them from the song title. So basically I wanted to drop the first three characters of the string (indices 0, 1, and 2). We can print out the shortened filenames by:

for i in *.m4a
  echo "${i:3}"

But if you do that and some of your files don’t start with numbers, you will see that it chops off the beginnings of those filenames. To avoid this, we need to use an if then statement to check whether the files begin with a number. For this it is sufficient to check whether the first character of the filename is “0″:

for i in *.m4a
	if [ "${i:0:1}" == "0" ]
		echo "${i:3}"

We just print the modified filenames in order to check whether our script operates as intended. This is an important caveat to bash scripting–check whether your script does what you want before you run it. Bash scripts are like sharp blades: in the hands of a master they are a wonderful tool, but in the hands of an amateur they can be deadly. I’m closer to the amateur end of the spectrum so I prefer to be careful. Once we are satisfied that our little tool is only chopping off what we want it to, then we are ready to compose the whole script with that mv command:

for i in *.m4a
	if [ "${i:0:1}" == "0" ]
		mv "${i}" "${i:3}"

If you put this in a file in the same directory as your playlist then you can just run bash from your terminal.

OK, so that may not have saved me that much time but it was way more fun than clicking through all those files!


Note: There are probably many other ways to accomplish the same outcome described in this post. This may include other music file managers, other export methods from iTunes, or even handy bash one-liners. The point was not to give an optimal method for organizing playlists but to show the process by which a bash script evolves to solve a simple one-off task. 

The Internet: Communication or Transportation?

The world we live in today is made of computers. We don’t have cars any more, we have computers we ride in. – Cory Doctorow (transcript)

Is the Internet a communication technology or a transportation technology? What does the answer to this question imply about Internet governance and the future of online liberty?

One thing technology does well is take multiple functions that were previously bound into the same physical process or object and split them into separate objects/subroutines, each of which does its own job so efficiently that the overall object/process works better than it did before. These chunks can also be recombined in new ways to do things that were not previously feasible.

online_communities_2An example is ebooks. Previously the storage, display, and transportation functions of a book were all combined into a single physical unit. The display of one book (its pages and ink) could be repurposed into another only by cutting it up, ransom-note style, or through a lengthy process of recycling. The display was also inseparable from the storage: if the display got wet, the data was marred forever. Transporting the information in the book could only be done by moving its entire bundle of atoms from one place to another

Enter the ebook. A single display can be used for a virtually infinite number of books. Storage is extensible, expandable, and expendable. If you want more, get it. If it breaks, replace it. And when you are ready to add a new edition to your collection it only takes a matter of seconds to transfer the bits.

Actually, the process goes back much further to when the written word disembodied message from messenger. Before this, shooting the messenger was the only primitive backspace key available. Burning books Fahrenheit 451-style can be tragic, but it is quite an improvement over burning bodies.

Is the Internet a simple continuation of this separation-optimization-recombination trend, or is it something more? The Internet is more similar to the spoken/written word jump than it is to the printed book/ebook development, because it allows the separation of consciousness from body. My body can be in almost any physical locations while my consciousness is bound up in a conversation, collaborative project, or game with almost anyone else from almost anywhere else.

In this way, the Internet is more like a transportation technology than it is a communications technology. Governing the roads was a nontrivial task for the early modern state. Then came air travel, which existed for a brief unregulated period before governments learned to exercise their control there. For more on the tension between innovation and regulation in transportation, see herehere, and here.

These early periods are open to rapid innovation, which also means that they permit risk-taking. This risk/opportunity trade-off chosen by state-avoidant peoples. States and their peoples see the opportunity but do not want the risk. Risk can be reduced or it can be hidden; the latter is cheaper and states are better at it, so it is often on that margin that they work to bring their peoples into new avenues of opportunity without fear. But by reducing the downside risk they also take away the upside of innovation.

The Internet is nearing this inflection point, if it has not already passed. It is a dangerous but promising frontier. Would you rather have pioneers as your guide, or big brother watching out for you?

Doublespeak: A Chrome App for the Orwellian Web

tl;dr: Doublespeak is a new Chrome web extension that replaces political doublespeak with plain English. It’s open source so you can help expand the dictionary of terms. 

George Orwell is well-known for introducing the terms “newspeak” and “doublethink” in his novel 1984. A portmanteau of the two, doublespeak, is more common in our modern lexicon–and unfortunately, so is the term that it represents. Another of Orwell’s works, “Politics and the English Language,” explains doublespeak using examples that seem almost quaint today (1946):

In our time, political speech and writing are largely the defense of the indefensible. Things like the continuance of British rule in India, the Russian purges and deportations, the dropping of the atom bombs on Japan, can indeed be defended, but only by arguments which are too brutal for most people to face, and which do not square with the professed aims of the political parties. Thus political language has to consist largely of euphemism, question-begging and sheer cloudy vagueness. Defenseless villages are bombarded from the air, the inhabitants driven out into the countryside, the cattle machine-gunned, the huts set on fire with incendiary bullets: this is called pacification.

Although Orwell is gone, the problems he describes are not. If anything, doublespeak has gotten worse in this age of “rendition,” TSA security theater, and PRISM.

Tim Lynch addressed this problem in the context of the War on Terror in 2006:

By corrupting the language, the people who wield power are able to fool the others about their activities and evade responsibility and accountability. Professor William Lutz, author of The New Doublespeak, notes: “Doublespeak is language that pretends to communicate but really doesn’t. It is language that makes the bad seem good, the negative appear positive, the unpleasant appear attractive or at least tolerable. Doublespeak is language that avoids or shifts responsibility, language that is at variance with its real or purported meaning. It is language that conceals or prevents thought; rather than extending thought, doublespeak limits it.”

It is true, of course, that dishonesty has always been a part of the human experience, but doublespeak is a pernicious variation of dishonesty. Doublespeak perverts the basic function of language, which is to facilitate a common understanding between human beings.

1984-posterLynch goes on to list several examples: “stop-loss” orders as a stand-in for conscription, the replacement of warrants by “national security letters,” and the renaming of Guantanamo prisoner suicides as “asymmetrical warfare.”

A–perhaps the–key point of Orwell’s conception of doublespeak is that words have meaning. Although this runs counter to postmodernism, it points out that language is a key front in the battle for ‘hearts and minds.’ Witness the recent discussion between a well-spoken University of Wisconsin student (‘Madiha’) and an on-campus recruiter for the NSA:

NSA RECRUITER 1: I’m focusing on what our foreign intelligence requires of [us], so…you can define ‘adversary’ as [an] enemy and clearly, Germany is not our enemy, but would we have foreign national interest from an intelligence perspective on what’s going on across the globe? Yes, we [would].

MADIHA: So by “adversary”, you actually mean anybody and everybody. There’s nobody, then – by your definition – that is not an adversary. Is that correct?

NSA RECRUITER 1: That is not correct.

Doublespeak has the power of the state behind it, which includes a great deal of technological sophistication. Until recently, I was more optimistic about the power of the internet to oppose conventional sources of political power. Although the recent Snowden revelations have diminished my confidence in technology as a political force, we can still use it as a tool to take back language.

To that end, I have developed a simple tool that you can use to counter doublespeak in your web viewing experience. It known as Doublespeak and is available as a Chrome web extension. Right now it has a small dictionary of three terms that it replaces, but can easily be extended for more. The code is also open-source on Github. When you install the extension, clicking its icon in the browser window will open a duplicate of the current page in a new tab, but with doublespeak terms replaced by their plain English equivalents.

Here are a few examples of the Doublespeak extension at work on these three pages:













Obviously it does not replace the text in images, but I think that makes the last example all the more striking. The extension should respond to titleized words, but some other special cases (e.g. all uppercase) are not handled in the current version (0.1).

If you have suggestions for new additions to the dictionary or other features, please let me know.

Reputation in Hacker Culture

I have long wanted to do a project on reputation in hacker culture. As I have delved into this further (and I still enjoy reading about it), it turns out Eric Raymond said it better than I could, nearly 20 years ago:

Like most cultures without a money economy, hackerdom runs on reputation. You’re trying to solve interesting problems, but how interesting they are, and whether your solutions are really good, is something that only your technical peers or superiors are normally equipped to judge.

Accordingly, when you play the hacker game, you learn to keep score primarily by what other hackers think of your skill (this is why you aren’t really a hacker until other hackers consistently call you one). This fact is obscured by the image of hacking as solitary work; also by a hacker-cultural taboo (now gradually decaying but still potent) against admitting that ego or external validation are involved in one’s motivation at all.

Specifically, hackerdom is what anthropologists call a gift culture. You gain status and reputation in it not by dominating other people, nor by being beautiful, nor by having things other people want, but rather by giving things away. Specifically, by giving away your time, your creativity, and the results of your skill.