Expectation-Maximization (EM) is one of those algorithms that leads to a genuine “ah-ha” moment once you understand it. While it can seem somewhat complicated at first its iterative nature makes it easy to visualize. This post will demonstrate expectation-maximization for a mixture of Gaussians in Python, using Matplotlib and Pandas.

The motivation for EM arises when you have a pair of problems whose solutions depend on one another. For example, suppose that you believe your data can be modeled as samples from a mixture of Gaussian distributions. If you knew which distribution each sample originated from you could estimate the parameters of the distribution. On the other hand, if you knew the parameters of all of the distributions you could estimate the probability that each data point originated from a given distribution. However, you don’t know either the assignments or the parameters, so how do you proceed?


Data points randomly generated from a known mixture of Gaussians

This is where EM comes in. Basically, you make initial guesses for both the assignments of points to distributions and their parameters, and then proceed iteratively. If you have priors over the hyperparameters you could use those and perform the experiment multiple times, or you could use random initial cluster assignments and relatively diffuse variances for the Gaussians. In this post I take the latter approach.

# initial guesses - intentionally bad
guess = { 'mu1': [1,1],
          'sig1': [ [1, 0], [0, 1] ],
          'mu2': [4,4],
          'sig2': [ [1, 0], [0, 1] ],
          'lambda': [0.4, 0.6]

Once you have initialized the assignments and parameters, you alternate between the expectation and maximization steps until your estimates converge (i.e. do not change much between iterations; for a mixture of Gaussians this is similar to convergence in k-means). In the expectation (E-)step, you assign each data point to the cluster from which it most likely originated. Then, in the maximization (M-)step you update the Gaussian parameters with maximum likelihood (or maximum a-posteriori) estimates.

Here is the E-step in Python:

# assign every data point to its most likely cluster
def expectation(dataFrame, parameters):
  for i in range(dataFrame.shape[0]):
    x = dataFrame['x'][i]
    y = dataFrame['y'][i]
    p_cluster1 = prob([x, y], list(parameters['mu1']), list(parameters['sig1']), parameters['lambda'][0] )
    p_cluster2 = prob([x, y], list(parameters['mu2']), list(parameters['sig2']), parameters['lambda'][1] )
    if p_cluster1 > p_cluster2:
      dataFrame['label'][i] = 1
      dataFrame['label'][i] = 2
  return dataFrame

And here is the M-step:

def maximization(dataFrame, parameters):
  points_assigned_to_cluster1 = dataFrame[dataFrame['label'] == 1]
  points_assigned_to_cluster2 = dataFrame[dataFrame['label'] == 2]
  percent_assigned_to_cluster1 = len(points_assigned_to_cluster1) / float(len(dataFrame))
  percent_assigned_to_cluster2 = 1 - percent_assigned_to_cluster1
  parameters['lambda'] = [percent_assigned_to_cluster1, percent_assigned_to_cluster2 ]
  parameters['mu1'] = [points_assigned_to_cluster1['x'].mean(), points_assigned_to_cluster1['y'].mean()]
  parameters['mu2'] = [points_assigned_to_cluster2['x'].mean(), points_assigned_to_cluster2['y'].mean()]
  parameters['sig1'] = [ [points_assigned_to_cluster1['x'].std(), 0 ], [ 0, points_assigned_to_cluster1['y'].std() ] ]
  parameters['sig2'] = [ [points_assigned_to_cluster2['x'].std(), 0 ], [ 0, points_assigned_to_cluster2['y'].std() ] ]
  return parameters

Note that this code could be made more generic to handle any number of distributions in the mixture. For this example I tried to err on the side of code clarity rather than flexibility.

The surprising part of EM, at least for me, is how quickly the estimates converge. In this example it took only six steps. As I mentioned earlier, this process is relatively easy to visualize.


Cluster assignments after three iterations of EM


Cluster assignments after six iterations of EM

As you can see, the “labels” are flipped here–points that were labeled as red above are blue here and vice-versa. That’s because the approach we took had no indication that “distribution 1” was the one with the lower mean or any such constraint. In practice this is not a problem because there is typically no requirement on the labels of the distributions; labels (such as “low X, low Y group”) can be assigned post-hoc if needed.

One issue that does arise with this particular problem is not knowing a priori how many distributions are in the mixture. Readers familiar with k-means will recognize this issue, and the solutions here are much the same: use an educated guess based on the problem context or data, or try multiple values of this hyperparameter and use a validation set to select this value. To form a prior over the number of distributions in the mixture you could also use the Indian Buffet Process.

This example and accompanying visualization should help you develop a better intuition for what expectation-maximization is, how it works, and when you might apply it. All code and graphics are available on Github. A similarly iterative and surprisingly effective algorithm that can be used for a variety of problems is A*, which I plan to discuss in an upcoming post. For more about expectation-maximization see chapter 7 of Simon Prince’s book–it is probably the single best explanation I have seen and includes examples of application within computer vision (and there is a free version online here).