Last year we looked at some of the most important papers in contemporary computer vision. One of the sections in that post discussed object detection and classification, which typically involved rectangular bounding boxes around regions of interest. A more detailed type of output that networks can produce is a classification for every pixel in the image, known as a segmentation. This post describes recent work on panoptic segmentation and provides a reading list on the topic.

Panoptic segmentation was a popular topic at this year’s CVPR conference. The name “panoptic” refers to the fact that it combines both semantic segmentation (classifying every pixel in the image) and instance segmentation (classifying distinct occurrences of objects in the same class, such as cars, animals, or pedestrians). Prior to the work listed below these two types of segmentation were typically pursued separately and used different types of networks to pursue their goals: semantic segmenters used fully convolutional nets, while instance segmenters used region-based object proposals. Combining these tasks into a single network so that they can use shared image features can result in a significant reduction in computational effort and/or an improvement in accuracy.

Another set of terms that comes up often in this research is “stuff” and “things.” “Things” typically refers to the countable, discrete elements in the image that can be assigned instance identifiers: people, vehicles, etc. “Stuff” refers to regions of the image that do not have instance identifiers, such as regions of ground, sky, water, and road.

If you have read the papers listed in the previous post, then the most important prior work to read in order to understand panoptic segmentation is the Mask R-CNN paper. Mask R-CNN extends Faster R-CNN by adding a branch that predicts a binary pixel mask for each class, which allows the network to perform instance segmentation when combined with its object detection branch.

Once you are familiar with Mask R-CNN you are ready to proceed with the papers below, listed in the recommended reading order:

  • Panoptic Segmentation: Unifying Semantic and Instance Segmentation” These slides from COCO 2017 introduce the basic concepts behind panoptic segmentation as well as the Panoptic Segmentation Quality (PSQ) metric for assessing results.
  • Panoptic Segmentation This paper describes the same work as in the above slides, with more detail and additional visualizations. One important thing to note is that in this paper panoptic segmentation was still performed as a heuristic combination of two separate tasks, rather than in a single network.
  • Panoptic Feature Pyramid Networks Authored by the same group as the paper above, in this article panoptic segmentation is performed by a single network. This is achieved by adding a semantic segmentation branch to Mask R-CNN using a shared Feature Pyramid Network as a “backbone” or “trunk” of the network.
  • FPN-based Network for Panoptic Segmentation” A separate group of authors produced a very similar network to the one above, with similar results.
  • Learning to Fuse Things and Stuff The novel contribution of this paper is an objective function that encourages the instance and segmentation outputs to be consistent with one another.
  • An End-to-End Network for Panoptic Segmentation This paper’s main contribution is a model that is occlusion-aware, by replacing prior heuristics used to address overlapping instances with a “spatial ranking module.”
  • Attention-guided Unified Network for Panoptic Segmentation The authors of this paper reframe the “stuff/things” distinction as “background/foreground,” and uses an attention module to connect features between the two branches.
  • UPSNet: A Unified Panoptic Segmentation Network UPSNet adds a “panoptic head” to the model that combines the instance and semantic outputs (using their mask logits) and adds the possibility of assigning pixels to an “unknown” class as a way to handle uncertainty and ambiguity.

These papers constitute an exciting line of research which promises more results yet to come.