Can you tell the difference between puppies and bagels? A labradoodle or fried chicken? Chihuahuas or muffins? Last week a fun meme from @teenybiscuit raised these questions, with some examples that were difficult to distinguish. It got me wondering how well computer vision methods could tell these apart.
I initially considered building a classifier with TensorFlow, but with so few examples in each class it is unlikely that we would get get very good results. Instead, I used Google’s Cloud Vision API which has been trained on thousands (if not millions) of images and can provide human-readable labels.
To interact with this API you will need a Google Cloud account. Once you have created your account, start a new project and enable the Cloud Vision API. Then, download your API key from the console and you are ready to go.
I based my example on the latest version of the google-api-ruby-client gem (0.9.4), which supports the Cloud Vision API. The Ruby gem’s documentation is a bit lacking, but I was able to figure out the format of the requests by using the API docs and spelunking through the code. My
DogFoodClient retrieves labels for each image and determines how many food- or dog-related labels were applied to the results.
|Image Type||Dog labels||Food Labels|
As you can see, the labeling system works remarkably well. All of the labradoodle/fried chicken examples are classified correct. Every chicken example included the term “fried food” as well as the more generic “food.” Chihuahuas and muffins were distinguished almost as well–only one muffin had the labels “dog breed” and “dog breed group” applied to it:
The trickiest group were the puppies who looked like bagels–three of the puppies were mislabeled. This one came back with the labels “food” and “kue” (an Indonesian dessert, I learned):
I cannot quite figure out why this puppy was labeled “dish” and “food.” Maybe because it is seated on a piece of furniture with a floral pattern, like some decorative plates?
Another example came back with the labels “baked goods”, “food”, “dessert”, “meal”, and “bread”–enough to make me wonder if I had messed up when splitting up the images into subdirectories. Instead, it appears that the spirals on the image (probably a screenshot from a stock photo site) caused the confusion:
Overall I was very impressed with the performance of the Cloud Vision API. These examples are amusing precisely because they are confusing to the human eye at first glance. If you have burning questions you would like to explore with this code–such as “shiba or marshmallow?”–submit a pull request on GitHub.