I recently presented a recap of my favorite presentations from ICML 2017. This post serves as a companion to my slides, which you can find here.
Junier B. Olivia, Barnabás Póczos, Jeff Schneider
- Problem: gated recurrent units require learning where we are in a sequence, when to open the gate, and the feature itself
- Innovation: Use exponentially weighted moving averages to preserve information about both temporality and scale
Alexander Kolesnikov, Christoph H. Lampert
- Problem: The original PixelCNN model generates images with reasonable local structure, but they sometimes have unnatural global structure
- Innovation: Use latent variables to enforce global structure, then refine to obtain good local structure
- Approach 1: Grayscale images
- Generate grayscale images with global structure, then add color
- Approach 2: Downsampled images
- Generate coarse resolution images with global structure, then refine (this step is equivalent to superresolution models)
Scott Reed et al.
- Problem 1: PixelCNN samples exhibit asymmetry. Because pixel generation occurs in raster order, the top-left pixel has no history while the bottom-right pixel is (potentially) conditioned on every previous pixel.
- Problem 2: PixelCNN sampling can be slow
- Innovation: Generate portions of the image in parallel. This helps to balance the information asymmetry and also improves speed.
Ransalu Senanayake, Fabio Ramos
- Problem: we would like to be able to mark every point in a space as occupied/unoccupied (or give it a probability of being occupied). Older methods of doing this either rely on grids with fixed resolution, or Gaussian processes (slow/computationally expensive). Also the probability of adjacent points being occupied is correlated.
- Innovation: Use a Hilbert map on continuous space to assign probabilities.
Jonathan Binas, Daniel Neil, Shih-Chii Liu, Tobi Delbruck
- Problem: cameras have a fixed temporal resolution (the frame rate), high data rates (capture images of approximately equal byte size at every frame, even if highly redundant), motion blur, and limited dynamic range (e.g. difficulty adjusting to brightness changes).
- Innovation: Add a brightness change sensor that works in essentially real-time to improve temporal resolution and account for brightness changes (on log scale).
- See https://inilabs.com/products/dynamic-and-active-pixel-vision-sensor/
Learning to Scale
Minsik Cho, Daniel Brand
- Problem: Convolution is an expensive memory access pattern, but a relatively simple computation.
- Innovation: Use a BLAS-friendly memory layout to access memory more efficiently.
Esteban Real et al.
- Problem: There exist orders of magnitude more problems that could benefit from deep learning than there exist developers who can create them
- Innovation: Use an evolutionary tournament to evolve models rather than architect them by hand.
- Results: Two metaparameters mattered: population size and training steps. Both are monotonically increasing, so do as much as your time/computational constraints will allow.
- Application: This shows that it is possible to version and manage a large number of models. The authors used a filesystem structure with files that describe model traits (its parent, its performance, its architecture, etc)
Using Hardware Efficiently
Azalia Mirhoseini et al.
- Problem: most assignment of computational tasks in DNN training/architecture are doing heuristically (e.g. wherever we have the most bandwidth or memory)
- Innovation: use reinforcement learning to learn a policy of how to assign tasks to resources, while optimizing an objective function (e.g. minimize training time)
- Results: the model learns to optimize resources and reduce waste, e.g. it won’t assign 4 GPUs and 1 CPU if 1 CPU can only keep up with 2 GPUs.
David Budden et al.
- Motivation: There exist many more CPUs than GPUs. Plus CPUs are cheaper up front. If you can get work done on CPUs, you don’t have to wait for the GPUs in your cluster to become available.
- Innovation: Treat convolution as polynomial multiplication. Perform a reduction jointly on the number of channels and number of features.
- Results: if your sparse matrix S has dimensions T * K, the speedup factor will be 2TK/(T+K)
Édouard Grave et al.
- Problem: Long-range RNNs are computationally expensive, especially when taking softmax on large vocabularies.
- Innovation: Use prior information about word frequency (follows a Zipf distribution) to divide the vocabulary into more/less frequent clusters, rather than hierarchical clustering by class.
Yunhe Wang et al.
- Motivation: DNNs can be compressed by 75% without substantial loss of performance
- Approach: maintain accuracy of the original network by preserving the distance between feature maps (i.e. learn a projection and remove redundant information in the feature maps)
Marc T. Law, Raquel Urtasun, Richard Zemel
- Goal: cluster similar examples, where similarity doesn’t necessarily mean belonging to the same class but could mean, e.g., visual similarity
- Approach: recast k-means as a gradient-based nonlinear regression problem by introducing Y, a matrix of cluster assignments, as the target
Active Learning for Vision
Yarin Gal, Riashat Islam, Zoubin Ghahramani
- Problem 1: Most deep learning models do not give true measures of uncertainty. For active learning, we would ideally like to query the most uncertain examples.
- Problem 2: Measures of uncertainty are computationally expensive to compute, so Bayesian methods have been unpopular in deep learning.
- Approach: Use dropout as Bayesian approximation to estimate uncertainty.
- See http://mlg.eng.cam.ac.uk/yarin/blog_2248.html
Alex Graves et al.
- Motivation: What do we teach our DNN and when do we teach it?
- Approach: Sample tasks with probability 1-p(correct answer)
- Result: Uniform sampling works unreasonably well
- Note: This may not be true for vision, where we know that most DNNs learn similar features in their early layers (e.g. edge detectors)
John Lipor, Laura Balzano
- Motivation: We would like to do active learning to place samples from the same distribution into the same subspace. For facial recognition, this amounts to asking annotators, “who is the person in this photo?” which is difficult/impossible to answer in practice.
- Approach: Instead, form pairs of images near decision boundary and ask annotators, “are these two photos of the same person?”
David Bau et al.
- Motivation: We would like to have a better understanding of what our DNNs see.
- Approach 1: Look at the top activated images for each class, to see what the DNN thinks of as “ideal” members of a class
- Approach 2: Look at individual units across epochs. This is especially interesting when fine-tuning, because you can see what a unit “turns into”.
- See http://netdissect.csail.mit.edu/
Ge Liu, David Gifford
- Motivation: See which classes’ feature maps are correlated
- Approach 1: Invert neuron codes into image space or look at top activated inputs as a way to ask the question "which part of the input makes you say that?"
- Approach 2: Break the network into separate parts and interpret them separately (correlation matrix of feature maps)
Been Kim and Cynthia Rudin
- Motivation: Users prefer interpretable models
- Example 1: doctors would like to know why a model predicts that a patient may have sepsis
- Example 2: teachers would like to know why a model classifies a homework submission as high/low quality
- Approach: Use prototypes (real-world, representative examples) to show why a particular case was classified/clustered in a particular way
Tianlin (Tim) Shi et al.
- Movation part 1: Virtual assistants provide large amounts of data when they perform a task, such as booking a flight for a client
- Motivation part 2: Lack of realistic datasets for reinforcement learning
- Innovation: Create a JS snippet that can capture mouse clicks, keypresses, etc when a virtual assistant performs a task. Then, train a reinforcement learning that attempts to perform these actions in a way that generates the same POST request that a human’s browser generates.
Chris Donahue, Zachary C. Lipton, and Julian McAuley
- Problem: Choreography for the game Dance Dance Revolution is limited to a small number of songs, and is produced by human annotators.
- Approach: Learn (a) when to place dance steps and (b) which steps to place using deep learning.
- Results: qualitatively interesting choreography, can reproduce a single-author corpus with 61% accuracy
- Applications: this is similar to the routing that a map application must produce, but easier/safer/quicker to test
Tolga Bolukbasi et al.
- Problem: Word2vec models trained on human-generated texts exhibit the same biases as their human authors (e.g. sexism, racism)
- Application: This shows that our ML models are garbage-in/garbage out. Beyond optimizing our objective functions, our decisions and approaches have ethical implications. For example, self-driving vehicles have the potential to save many human lives but also carry huge risks if they misbehave.