Auto-Encoding Variational Bayes¶
Tutorial on Variational Autoencoders¶
- Training generative models has been a long-standing problem and there are three main drawbacks:
they might require strong assumptions about the structure of the data
they might make severe approximations, leading to suboptimal models
they might rely on computationally expensive inference procedures like Markov Chain Monte Carlo
In this work, the authors propose a stochastic variational inference and learning algorithm that performs efficient inference and learning in directed probabilistic models, in the presence of continuous latent variables with intractable posterior distributions and large datasets.
- Their contributions is two-fold:
They show that a reparameterization of the variational lower bound yields a lower bound estimator that can be straightforwardly optimized using standard stochastic gradient methods.
They show that for i.i.d. datasets with continuous latent variables per datapoint, posterior inference can be made especially efficient by fitting an approximate inference model to the intractable posterior using the proposed lower bound estimator.
When a neural network is used for the recognition model, we arrive at the variational auto-encoder.
Method¶
The strategy in this section can be used to derive a lower bound estimator (a stochastic objective function) for a variety of directed graphical models with continuous latent variables. We will restrict ourselves here to the common case where we have an i.i.d. dataset with latent variables per datapoint, and where we like to perform maximum likelihood (ML) or maximum a posteriori (MAP) inference on the (global) parameters, and variational inference on the latent variables.
Problem Scenario¶
- Let us consider some dataset \(\mathbf{X} = \{\mathbf{x}^{(i)}\}_{i=1}^N\) consisting of \(N\) i.i.d. samples. We assume that the data are generated by some random process, involving an unobserved continuous random variable \(\mathbf{z}\). This process consists of two steps:
a value \(\mathbf{z}^{(i)}\) is generated from some prior distribution \(p_{\mathbf{\theta}^*}(\mathbf{z})\)
a value \(\mathbf{x}^{(i)}\) is generated from some conditional distribution \(p_{\mathbf{\theta}^*}(\mathbf{x} \mid \mathbf{z})\)
We assume that the prior \(p_{\mathbf{\theta}^*}(\mathbf{z})\) and likelihood \(p_{\mathbf{\theta}^*}(\mathbf{x} \mid \mathbf{z})\) come from parametric families of distributions \(p_\mathbf{\theta}(\mathbf{z})\) and \(p_\mathbf{\theta}(\mathbf{x} \mid \mathbf{z})\), and their PDFs are differentiable almost everywhere w.r.t. both \(\mathbf{\theta}\) and \(\mathbf{z}\). This process is hidden and values of \(\mathbf{\theta}^*\) and \(\mathbf{z}^{(i)}\) are unknown to us.
- The authors do not make common simplifying assumptions about the marginal or posterior probabilities. Conversely, they are interested in a general algorithm that even works efficiently in the case of:
Intractability: the case where the integral of the marginal likelihood \(p_\mathbf{\theta}(\mathbf{x}) = \int p_\mathbf{\theta}(\mathbf{z})p_\mathbf{\theta}(\mathbf{x} \mid \mathbf{z})d\mathbf{z}\) is intractable
A large dataset: we have so much data that batch optimization is too costly