So far, I’ve written about two types of generative models, GAN and VAE. Neither of them explicitly learns the probability density function of real data,
Flow-based deep generative models conquer this hard problem with the help of normalizing flows, a powerful statistics tool for density estimation. A good estimation of
Types of Generative Models
Here is a quick summary of the difference between GAN, VAE, and flow-based generative models:
- Generative adversarial networks: GAN provides a smart solution to model the data generation, an unsupervised learning problem, as a supervised one. The discriminator model learns to distinguish the real data from the fake samples that are produced by the generator model. Two models are trained as they are playing a minimax game.
- Variational autoencoders: VAE inexplicitly optimizes the log-likelihood of the data by maximizing the evidence lower bound (ELBO).
- Flow-based generative models: A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution
and therefore the loss function is simply the negative log-likelihood.

Linear Algebra Basics Recap
We should understand two key concepts before getting into the flow-based generative model: the Jacobian determinant and the change of variable rule. Pretty basic, so feel free to skip.
Jacobian Matrix and Determinant
Given a function of mapping a
The determinant is one real number computed as a function of all the elements in a squared matrix. Note that the determinant only exists for square matrices. The absolute value of the determinant can be thought of as a measure of “how much multiplication by the matrix expands or contracts space”.
The determinant of a nxn matrix
where the subscript under the summation
The determinant of a square matrix
The determinant of the product is equivalent to the product of the determinants:
Change of Variable Theorem
Let’s review the change of variable theorem specifically in the context of probability density estimation, starting with a single variable case.
Given a random variable
By definition, the integral
The multivariable version has a similar format:
where
What is Normalizing Flows?
Being able to do good density estimation has direct applications in many machine learning problems, but it is very hard. For example, since we need to run backward propagation in deep learning models, the embedded probability distribution (i.e. posterior
Here comes a Normalizing Flow (NF) model for better and more powerful distribution approximation. A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions. Flowing through a chain of transformations, we repeatedly substitute the variable for the new one according to the change of variables theorem and eventually obtain a probability distribution of the final target variable.

As defined in Fig. 2,
Then let’s convert the equation to be a function of
(*) A note on the “inverse function theorem”: If
(*) A note on “Jacobians of invertible function”: The determinant of the inverse of an invertible matrix is the inverse of the determinant:
Given such a chain of probability density functions, we know the relationship between each pair of consecutive variables. We can expand the equation of the output
The path traversed by the random variables
- It is easily invertible.
- Its Jacobian determinant is easy to compute.
Models with Normalizing Flows
With normalizing flows in our toolbox, the exact log-likelihood of input data
RealNVP
The RealNVP (Real-valued Non-Volume Preserving; Dinh et al., 2017) model implements a normalizing flow by stacking a sequence of invertible bijective transformation functions. In each bijection
- The first
dimensions stay same; - The second part,
to dimensions, undergo an affine transformation (“scale-and-shift”) and both the scale and shift parameters are functions of the first dimensions.
where
Now let’s check whether this transformation satisfy two basic properties for a flow transformation.
Condition 1: “It is easily invertible.”
Yes and it is fairly straightforward.
Condition 2: “Its Jacobian determinant is easy to compute.”
Yes. It is not hard to get the Jacobian matrix and determinant of this transformation. The Jacobian is a lower triangular matrix.
Hence the determinant is simply the product of terms on the diagonal.
So far, the affine coupling layer looks perfect for constructing a normalizing flow :)
Even better, since (i) computing
In one affine coupling layer, some dimensions (channels) remain unchanged. To make sure all the inputs have a chance to be altered, the model reverses the ordering in each layer so that different components are left unchanged. Following such an alternating pattern, the set of units which remain identical in one transformation layer are always modified in the next. Batch normalization is found to help training models with a very deep stack of coupling layers.
Furthermore, RealNVP can work in a multi-scale architecture to build a more efficient model for large inputs. The multi-scale architecture applies several “sampling” operations to normal affine layers, including spatial checkerboard pattern masking, squeezing operation, and channel-wise masking. Read the paper for more details on the multi-scale architecture.
NICE
The NICE (Non-linear Independent Component Estimation; Dinh, et al. 2015) model is a predecessor of RealNVP. The transformation in NICE is the affine coupling layer without the scale term, known as additive coupling layer.
Glow
The Glow (Kingma and Dhariwal, 2018) model extends the previous reversible generative models, NICE and RealNVP, and simplifies the architecture by replacing the reverse permutation operation on the channel ordering with invertible 1x1 convolutions.

There are three substeps in one step of flow in Glow.
Substep 1: Activation normalization (short for “actnorm”)
It performs an affine transformation using a scale and bias parameter per channel, similar to batch normalization, but works for mini-batch size 1. The parameters are trainable but initialized so that the first minibatch of data have mean 0 and standard deviation 1 after actnorm.
Substep 2: Invertible 1x1 conv
Between layers of the RealNVP flow, the ordering of channels is reversed so that all the data dimensions have a chance to be altered. A 1×1 convolution with equal number of input and output channels is a generalization of any permutation of the channel ordering.
Say, we have an invertible 1x1 convolution of an input
Both the input and output of 1x1 convolution here can be viewed as a matrix of size
The inverse 1x1 convolution depends on the inverse matrix
Substep 3: Affine coupling layer
The design is same as in RealNVP.

Models with Autoregressive Flows
The autoregressive constraint is a way to model sequential data,
How to model the conditional density is of your choice. It can be a univariate Gaussian with mean and standard deviation computed as a function of
If a flow transformation in a normalizing flow is framed as an autoregressive model — each dimension in a vector variable is conditioned on the previous dimensions — this is an autoregressive flow.
This section starts with several classic autoregressive models (MADE, PixelRNN, WaveNet) and then we dive into autoregressive flow models (MAF and IAF).
MADE
MADE (Masked Autoencoder for Distribution Estimation; Germain et al., 2015) is a specially designed architecture to enforce the autoregressive property in the autoencoder efficiently. When using an autoencoder to predict the conditional probabilities, rather than feeding the autoencoder with input of different observation windows
In a multilayer fully-connected neural network, say, we have
Without any mask, the computation through layers looks like the following:

To zero out some connections between layers, we can simply element-wise multiply every weight matrix by a binary mask matrix. Each hidden node is assigned with a random “connectivity integer” between
A unit in the current layer can only be connected to other units with equal or smaller numbers in the previous layer and this type of dependency easily propagates through the network up to the output layer. Once the numbers are assigned to all the units and layers, the ordering of input dimensions is fixed and the conditional probability is produced with respect to it. See a great illustration in Fig. 5. To make sure all the hidden units are connected to the input and output layers through some paths, the
MADE training can be further facilitated by:
- Order-agnostic training: shuffle the input dimensions, so that MADE is able to model any arbitrary ordering; can create an ensemble of autoregressive models at the runtime.
- Connectivity-agnostic training: to avoid a model being tied up to a specific connectivity pattern constraints, resample
for each training minibatch.
PixelRNN
PixelRNN (Oord et al, 2016) is a deep generative model for images. The image is generated one pixel at a time and each new pixel is sampled conditional on the pixels that have been seen before.
Let’s consider an image of size

Every pixel
One implementation that could capture the entire context is the Diagonal BiLSTM. First, apply the skewing operation by offsetting each row of the input feature map by one position with respect to the previous row, so that computation for each row can be parallelized. Then the LSTM states are computed with respect to the current pixel and the pixels on the left.

where
The diagonal BiLSTM layers are capable of processing an unbounded context field, but expensive to compute due to the sequential dependency between states. A faster implementation uses multiple convolutional layers without pooling to define a bounded context box. The convolution kernel is masked so that the future context is not seen, similar to MADE. This convolution version is called PixelCNN.

WaveNet
WaveNet (Van Den Oord, et al. 2016) is very similar to PixelCNN but applied to 1-D audio signals. WaveNet consists of a stack of causal convolution which is a convolution operation designed to respect the ordering: the prediction at a certain timestamp can only consume the data observed in the past, no dependency on the future. In PixelCNN, the causal convolution is implemented by masked convolution kernel. The causal convolution in WaveNet is simply to shift the output by a number of timestamps to the future so that the output is aligned with the last input element.
One big drawback of convolution layer is a very limited size of receptive field. The output can hardly depend on the input hundreds or thousands of timesteps ago, which can be a crucial requirement for modeling long sequences. WaveNet therefore adopts dilated convolution (animation), where the kernel is applied to an evenly-distributed subset of samples in a much larger receptive field of the input.

WaveNet uses the gated activation unit as the non-linear layer, as it is found to work significantly better than ReLU for modeling 1-D audio data. The residual connection is applied after the gated activation.
where
Masked Autoregressive Flow
Masked Autoregressive Flow (MAF; Papamakarios et al., 2017) is a type of normalizing flows, where the transformation layer is built as an autoregressive neural network. MAF is very similar to Inverse Autoregressive Flow (IAF) introduced later. See more discussion on the relationship between MAF and IAF in the next section.
Given two random variables,
Precisely the conditional probability is an affine transformation of
- Data generation, producing a new
:
- Density estimation, given a known
:
The generation procedure is sequential, so it is slow by design. While density estimation only needs one pass the network using architecture like MADE. The transformation function is trivial to inverse and the Jacobian determinant is easy to compute too.
Inverse Autoregressive Flow
Similar to MAF, Inverse autoregressive flow (IAF; Kingma et al., 2016) models the conditional probability of the target variable as an autoregressive model too, but with a reversed flow, thus achieving a much efficient sampling process.
First, let’s reverse the affine transformation in MAF:
If let:
Then we would have,
IAF intends to estimate the probability density function of

Computations of the individual elements
Base distribution | Target distribution | Model | Data generation | Density estimation | |
---|---|---|---|---|---|
MAF | Sequential; slow | One pass; fast | |||
IAF | One pass; fast | Sequential; slow | |||
———- | ———- | ———- | ———- | ———- | ———- |
VAE + Flows
In Variational Autoencoder, if we want to model the posterior
If you notice mistakes and errors in this post, don’t hesitate to contact me at [lilian dot wengweng at gmail dot com] and I would be very happy to correct them right away!
See you in the next post :D
Cited as:
@article{weng2018flow,
title = "Flow-based Deep Generative Models",
author = "Weng, Lilian",
journal = "aptsunny.github.io",
year = "2018",
url = "https://aptsunny.github.io/posts/2018-10-13-flow-models/"
}
Reference
[1] Danilo Jimenez Rezende, and Shakir Mohamed. “Variational inference with normalizing flows." ICML 2015.
[2] Normalizing Flows Tutorial, Part 1: Distributions and Determinants by Eric Jang.
[3] Normalizing Flows Tutorial, Part 2: Modern Normalizing Flows by Eric Jang.
[4] Normalizing Flows by Adam Kosiorek.
[5] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. “Density estimation using Real NVP." ICLR 2017.
[6] Laurent Dinh, David Krueger, and Yoshua Bengio. “NICE: Non-linear independent components estimation." ICLR 2015 Workshop track.
[7] Diederik P. Kingma, and Prafulla Dhariwal. “Glow: Generative flow with invertible 1x1 convolutions." arXiv:1807.03039 (2018).
[8] Germain, Mathieu, Karol Gregor, Iain Murray, and Hugo Larochelle. “Made: Masked autoencoder for distribution estimation." ICML 2015.
[9] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. “Pixel recurrent neural networks." ICML 2016.
[10] Diederik P. Kingma, et al. “Improved variational inference with inverse autoregressive flow." NIPS. 2016.
[11] George Papamakarios, Iain Murray, and Theo Pavlakou. “Masked autoregressive flow for density estimation." NIPS 2017.
[12] Jianlin Su, and Guang Wu. “f-VAEs: Improve VAEs with Conditional Flows." arXiv:1809.05861 (2018).
[13] Van Den Oord, Aaron, et al. “WaveNet: A generative model for raw audio." SSW. 2016.