This post is a round-up of some of the most important and influential deep learning papers for computer vision. It covers popular convolutional neural networks (CNNs), models designed for detecting and classifying objects, text detection models, and image segmentation. Taken together, these papers provide an overview of deep learning as applied to computer vision and serve as a useful introduction for practitioners new to the field.

Progress in computer vision began to accelerate rapidly with the introduction of deep learning techniques–especially convolutional neural networks (CNNs). Instead of relying on hand-tuned features that were popular in classical computer vision, CNNs “learn” the features that serve to accurately complete the task at hand. Many of these papers are well-known because of their success in the annual ImageNet image classification challenge.

  • LeNet Yann LeCun and his co-authors were the first to apply a modern CNN architecture to computer vision. The model was designed to recognize handwritten characters, with the idea that it could be used to automatically process mail. More information about the model is available here.
  • AlexNet This paper was the first CNN architecture to win the ImageNet challenge. Its main contribution is the idea of a “deep” CNN architecture with five convolutional layers and three fully-connected layers.
  • GoogleNet, aka Inception The model here extends AlexNet by adding max-pooling layers for dimensionality reduction. It won the 2014 ImageNet challenge.
  • VGG This model goes even deeper than previous ImageNet winners, with a total of 16 layers, and does so using only 3-by-3 convolutions, max pooling, and fully connected layers (without the 5-by-5 convolutions in GooleNet).
  • ResNet One problem in the training of deep learning models is the “vanishing gradient” problem, which prevents the model from learning. ResNets add “shortcut connections” between layers to address this problem, effectively creating an ensemble of shallower networks.

Object Detection & Classification

The models in the previous section are primarily concerned with classification: given an image, what class(es) does it belong to? For that problem, the model always outputs a fixed number of outputs: either a single scalar value representing the class, or a vector of scores indicating the likelihood of each class. Detecting objects in images is a more difficult problem, because we want a model to be able to detect all the instances of the object that appear and we do not know a priori how many there are. This means that we need a model that supports variable-length output. The models in this section represent various approaches to the problem of object detection.

  • R-CNN This paper introduces the idea of region proposal: dividing an image into sub-regions, then evaluating them with a CNN for the generation of features, and finally running a regression for bounding box coordinates and an SVM for classifying whether or not an object exists. Essentially this reframes detection as a classification problem.
  • Fast R-CNN This model makes a very natural extension to R-CNN: use the CNN features to identify the regions of interest via selective search.
  • Faster R-CNN In the R-CNN series, this is the first model that would be considered a “modern” object detection model. A sub-network learns to propose regions of interest, which are then fed to the classifier.
  • YOLO One of the shortcomings of the R-CNN series of models is that the network does not look at the full image, only the regions that have a high probability of containing an object (per the region proposals). In YOLO (“You Only Look Once”), a single network predicts both bounding boxes (many per image) and a class probability map (gridded across the image). These predictions are then joined together to form the final detections, which can be detections of multiple object classes.
  • YOLOv2 This extension of YOLO uses the same idea of gridding up the image. The main change is that for each grid cell, five anchor boxes are evaluated for detecting objects of different aspect ratios.
  • SSD This model is quite similar to both YOLO and Faster R-CNN. The main difference from Faster R-CNN is that SSD produces a score for each class in every box (whereas Faster R-CNN uses a separate classifier). This model is faster than the original YOLO model.

Detection and Information Extraction

Detecting coherent objects such as cars and vehicles, as the models in the previous section do, is one thing. They tend to be made up of fairly uniform parts that are straightforward to describe: a chassis, some wheels, and so on. Detecting text is a somewhat different problem with quite a bit of variation, because it is made up of potentially many words, which ar themselves made up of characters, in one or more of a very large set of possible fonts and colors. Text also appears at many sizes and orientations against a variety of backgrounds. The models listed in this section are aimed at detecting such text in natural scenes.


All of the detectors listed so far provide polygon representations of the objects they find in images. Sometimes this is not detailed enough and you need to categorize everything in the image on a per-pixel basis. This amounts to classifying every pixel in an image, taking into account the neighboring classifications so that regions are coherent. That is where segmentation models come in.

  • Mask R-CNN As its name suggests, this model builds upon the R-CNN series of architectures. It uses the same region proposal and classifier setup to obtain an object detection, and then runs additional convolutional layers within the detection box to generate a binary mask of object presence or absence.
  • Fully Convolutional Networks for Semantic Segmentation Since segmentation produces per-pixel classifications, you would like to be able to run it on images of arbitrary size without the model being sensitive to features such as aspect ratio or resolution. The authors of this paper achieve that criterion by using a fully convolutional architecture: a series of convolutional layers that feed into a pixelwise classifier. The information about neighboring pixels comes in through the convolutional features, which builds on the development of earlier models such as VGG and Alexnet.


The papers above provide a solid overview of both the recent past and curent developments in deep learning for computer vision. By understanding the related tasks of classification, object detection, and segmentation you will see the widespread usefulness of convolutional neural networks. These techniques can also be combined into composite models, such as detecting regions of text and then passing them to an optical character recognition (OCR) model.

After reviewing the fundamental techniques in these papers, some areas to explore further are training and labeling (understanding loss functions, transfer learning, active learning, and knowledge distillation), adversarial networks (GANs and adversarial perturbations), and online models (those that run at real-time frame rates, especially ones that can be deployed on mobile devices such as SqueezeNet and MobileNet).