iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📏

[Generative AI] Why Maximize ELBO instead of Log-Likelihood? [VAE & Diffusion Models]

に公開

Introduction

This article is the 10th-day entry for the Generative AI Advent Calendar 2024.

While studying image generation AI, you often hear that to model a probability distribution, the model is trained to maximize the log-likelihood.
I have also written articles providing simple theoretical explanations regarding "Diffusion Models," one of the image generation AI architectures.

https://zenn.dev/asap/articles/4092ab60570b05
https://zenn.dev/asap/articles/8aaa896a02f168

I mentioned theoretical explanation, but I explained it simply without using many mathematical formulas, so I believe anyone can read and understand it easily.
(The second article has a bit more math, so it is mainly for those interested in the mathematical aspects.)

As explained in the articles above, image generation AI basically learns the probability distribution of natural images. To do this, it attempts to learn the distribution by maximizing the log-likelihood.

However, in most textbooks and technical articles (regarding Diffusion Models or VAEs), you will find that the ELBO is maximized instead of the log-likelihood.
This article explores "why we maximize the ELBO instead of the log-likelihood."

The conclusion is that "in the framework of VAEs and Diffusion Models, the log-likelihood cannot be calculated directly."

References

Deep Learning from Scratch ❺ — Generative Models

This is the fifth installment of the "Deep Learning from Scratch" series, which is an overwhelmingly masterpiece series.
With the goal of eventually understanding Diffusion Models, this book allows you to implement while understanding from scratch, going back to VAEs, Gaussian Mixture Models, and even the Normal Distribution. It is a truly wonderful book.

Even on its own, the book is very clear, but I felt the section regarding "why we maximize ELBO instead of log-likelihood" (or rather, why we must use ELBO) had some gaps between the lines. I hope this article can bridge those gaps.
(Of course, compared to other books, the gaps are already very narrow, but for someone like me who is weak at math, it took time to understand, so I want to fill those gaps. This is also for my future self.)

Introduction to Calculus + Linear Algebra for Uncompromising Data Analysis

The latter half of this book contains brief descriptions of VAEs and ELBO, and I referred to its phrasing while writing this article.
It was very easy for beginners to understand because it always explicitly states which parts of the probability distribution are computable and which are not!

Especially if you want to understand the theory properly rather than just using generative AI, you will need prior knowledge of "Linear Algebra" and "Calculus" to understand papers and theoretical texts.
This is an excellent book for beginners to acquire the necessary foundational knowledge first.

Mathematics of Diffusion Models: Data Generation Technology

This is the definitive book for the theory of Diffusion Models.
Since it uses more formulas, it might be easier to understand after finishing Deep Learning from Scratch ❺ — Generative Models or Introduction to Calculus + Linear Algebra for Uncompromising Data Analysis (or having equivalent knowledge).
The great thing about this book is that while it uses more formulas for a rigorous understanding, the explanation between those formulas is narrow, making it very easy to comprehend.
Also, difficult parts are supplemented with diagrams for visual understanding, so even those who find university textbooks challenging should be able to follow along if they have a STEM background!

(Book links are Amazon affiliate links)

About Log-Likelihood

I explained how (image) generation models create diverse images in this article, but let's take a quick look back.

Prepare a dataset D containing a large number of natural images.

If each natural image is x_k (k is a natural number), it can be expressed as:

D=\{x_1,x_2,......\}

Assume that all these images x_k are sampled from a God-given probability distribution of natural images p(x) and brought into the real world.
From this perspective, if we can reproduce that probability distribution by some method, we can sample from it to generate images similar to real natural images.

If we assume this reproduction probability distribution is controlled by some parameter \theta, we can write it as p_\theta.

How do we determine this parameter? Optimization is possible by updating the parameter \theta to maximize the conditional probability of the natural image x given \theta.
(To be tedious, please refer to the previous article for this explanation.)

This conditional probability can be written as p_\theta(x) or p(x|\theta), but for clarity in the subsequent formula derivations, I will use p_\theta(x).
I thought this would make it easier to distinguish \theta from other parameters.

This conditional probability represents "the probability that data was observed under a certain probability distribution (model parameters) when a specific natural image was obtained," and is specifically called "likelihood."

In (image) generation models, the parameter \theta is optimized by maximizing this likelihood—in other words, through "Maximum Likelihood Estimation" (MLE).
Specifically:

L(\theta; x_1, x_2, \dots, x_n) = \prod_{i=1}^{n} p_{\theta}(x_i)

Furthermore, when performing Maximum Likelihood Estimation, the likelihood is converted to log-likelihood.
Specifically:

\ell(\theta; x_1, x_2, \dots, x_n) = \log L(\theta; x_1, x_2, \dots, x_n) = \sum_{i=1}^{n} \log p_{\theta}(x_i)

The product of likelihoods easily causes underflow as the number of data points increases.
Therefore, by taking the logarithm, the objective function can be transformed into a sum of data points, enabling stable learning.
Additionally, since the logarithm is a monotonically increasing function, the parameters that maximize the log-likelihood are identical to those that maximize the likelihood.

Why the log-likelihood cannot be calculated

Now, the review is complete.
As mentioned above, in image generation models and others, parameters \theta are chosen to maximize the log-likelihood \log p_{\theta}(x).

However, to do that, it is first necessary to calculate the value of \log p_{\theta}(x).

What is likelihood (probability distribution) in VAEs and diffusion models?

In VAEs and diffusion models, the final output is the generated image itself.

For example, in a framework like "PixelCNN," the final output is a 256-dimensional classification, outputting a probability distribution where pixel values from 0 to 255 are treated as discrete values. In such a framework, since the model output itself is a probability distribution, it is easy to imagine how to calculate the likelihood.

So, what kind of probability distribution is assumed when the final output is the generated image itself? We assume a normal distribution where the mean is the network output \hat{x} and the variance is I.
As explained in the formula derivations later, by setting it up this way, the final objective function boils down to minimizing the squared error between the ground truth data x and the generated data \hat{x}.

In other words, the parameters of the probability distribution are determined solely by the mean \hat{x}.
And this mean \hat{x} is determined by the output of the network.

This allows us to discuss the maximization of log-likelihood (the probability distribution) even within the framework of VAEs and diffusion models.

Now, let's dive into the actual discussion.

What exactly is log-likelihood?

Let's take another proper look at the log-likelihood formula.
Log-likelihood is expressed as follows:

\ell(\theta; x_1, x_2, \dots, x_n) = \log L(\theta; x_1, x_2, \dots, x_n) = \sum_{i=1}^{n} \log p_{\theta}(x_i)

For simplicity, let's consider \log p_{\theta}(x_i) focusing on a single sample i.

The meaning of this expression is the log-likelihood of the "ground truth data x_i" conditioned only on the network parameters \theta.
In other words, this p_{\theta}(x_i) must be a probability distribution that is not conditioned on anything other than the network parameters \theta—that is, it must be an independent probability distribution.

Yes, even with respect to latent representations.

In both VAEs and diffusion models, the decoder reconstructs an image using some latent representation z_i as input.
Below, for simplicity, we will focus on VAEs, but the general discussion is roughly the same for diffusion models.

The true nature of the probability distribution created by VAE

As mentioned above, in a VAE, the decoder reconstructs an image using some latent representation z_i as input.
Therefore, the probability distribution finally created by the network output (+ the subsequent normal distribution) is:

p_{\theta}(x_i|z_i)

This differs from the pure log-likelihood p_{\theta}(x_i).

Expressing it mathematically like this helps in understanding that they are completely different.

Now, let's try transforming the formula to calculate the pure log-likelihood.
Also, from here on, let the set of decoder parameters be \theta.
In other words, the probability distribution represented by p_{\theta} will be the one created by the decoder.

Log-likelihood formula transformation

Log-likelihood p_{\theta}(x_i) can be transformed using the definition of a probability distribution and Bayes' theorem.

First, let's consider the transformation based on the definition of a probability distribution. Then, the log-likelihood can be transformed as follows:

p_{\theta}(x_i) = \int p_{\theta}(x_i \mid z_i) p(z_i) \, dz_i

That is, we need to calculate the conditional probability p_{\theta}(x \mid z) for all possible latent representations and take the sum. For continuous and high-dimensional z, this is practically impossible.

Also, it's important to remember that the formula above is presented for a single sample i.
The actual objective function should involve the log-likelihood for all data, as follows:

\ell(\theta; x_1, x_2, \dots, x_n) = \sum_{i=1}^{n} \log \int p_{\theta}(x_i \mid z_i) p(z_i) \, dz_i

As you can see from the expression above, it's in the form of "log-sum." While "sum-log" might still be manageable, the "log-sum" form is difficult to solve analytically.
Therefore, even if the latent variable z were discrete and small in number, it cannot be calculated without techniques like alternating optimization.

Now, back to the formula transformation.
Next, let's transform it using Bayes' theorem. The following transformation holds:

p_{\theta}(x_i) = \dfrac{p(z_i)p_{\theta}(x_i|z_i)}{p_{\theta}(z_i|x_i)}

Now, let's check if each part of the following expression is computable.

p_{\theta}(x_i) = \dfrac{p(z_i)p_{\theta}(x_i|z_i)}{p_{\theta}(z_i|x_i)}

First, the p(z_i) in the numerator is the prior distribution of the latent representation z_i. Just as in Bayesian statistics, we can define and introduce this prior distribution ourselves, so it's not a problem.
In VAEs, a standard normal distribution is often used.

Next, p_{\theta}(x_i|z_i) in the numerator is the likelihood of the ground truth data x_i given the latent variable z_i, under the environment of decoder parameters \theta.
Therefore, it can be modeled as a normal distribution with the decoder output \hat{x} as the mean and I as the variance.

Finally, the denominator p_{\theta}(z_i|x_i) is the "posterior probability" of the latent variable z_i given the ground truth data x_i, under the environment of decoder parameters \theta.
Since this cannot be calculated from the decoder's perspective, further transformation is required.

Since it's a posterior probability, we transform it using Bayes' theorem.

p_{\theta}(z_i|x_i) = \dfrac{p(z_i)p_{\theta}(x_i|z_i)}{p_{\theta}(x_i)} = \dfrac{p(z_i)p_{\theta}(x_i|z_i)}{\int p_{\theta}(x_i \mid z_i) p(z_i) \, dz_i}

Therefore, an uncomputable form has appeared in the denominator again.

I've written at length, but in conclusion, the likelihood p_{\theta}(x_i) cannot be calculated because uncomputable parts appear no matter how the formula is transformed.

Future Direction

So, what do we do?
We know the weapon called ELBO, but let's assume for now that we don't know it.

The core idea is to consider the following equation and maximize m instead of the log-likelihood \log p_{\theta}(x_i).

\log p_{\theta}(x_i) \geq m

In the expression above, since the log-likelihood \log p_{\theta}(x_i) is always greater than or equal to some value m, the idea is to indirectly maximize the log-likelihood \log p_{\theta}(x_i) by maximizing m.

In doing so, what helps is KL divergence or Jensen's inequality.

Using Jensen's inequality allows for a more concise transformation, but since using KL divergence makes the intent of the expression easier to understand, we will consider the transformation using KL divergence.

KL divergence can be expressed by providing two probability distributions (f(x), g(x)) as follows:

D_{\text{KL}}(f(x) \| g(x)) = \int f(x) \log \frac{f(x)}{g(x)} \, dx

It is known that this KL divergence is always 0 or greater.
Therefore, by using the form:

\log p_{\theta}(x_i) = m + D_{\text{KL}}

of the form, we can create \log p_{\theta}(x_i) \geq m.

Simple Supplement

:::

Therefore, for probabilities that cannot be calculated, we consider pushing them into this KL divergence.

Furthermore, KL divergence is a formula that calculates the distance between two probability distributions, and it becomes 0 if the two distributions are identical.
Therefore, when the two distributions match, m and the log-likelihood \log p_{\theta}(x_i) will be identical, and maximizing m will perfectly correspond to maximizing the log-likelihood \log p_{\theta}(x_i).

Thus, when pushing uncomputable probability distributions into KL divergence, it's clear that it's better to prepare some probability distribution that can approximate the uncomputable one with high accuracy and push it in.

Now, based on the above ideas, let's proceed on the journey to transform the log-likelihood \log p_{\theta}(x_i) and find m which can be expressed only with computable distributions!

Actually Performing the Formula Transformation

Organizing Our Tools

Now, this is where the real work begins. We will actually perform the formula transformation of the log-likelihood \log p_{\theta}(x_i).

First, let's transform the expression using Bayes' theorem.

\log p_{\theta}(x_i) = \log \dfrac{p(z_i)p_{\theta}(x_i|z_i)}{p_{\theta}(z_i|x_i)} = \log p(z_i) + \log p_{\theta}(x_i|z_i) - \log p_{\theta}(z_i|x_i)

As mentioned earlier, the uncomputable distribution was the denominator p_{\theta}(z_i|x_i). Since \theta represents the parameters of the decoder, we cannot calculate the posterior distribution of the latent variable z_i.

Let's consider pushing this uncomputable distribution into the KL divergence D_{\text{KL}}. KL divergence D_{\text{KL}} is defined as follows:

D_{\text{KL}}(f(x) \| g(x)) = \int f(x) \log \frac{f(x)}{g(x)} \, dx

KL divergence D_{\text{KL}} is always non-negative and equals 0 when the two distributions are identical. We will consider a distribution that approaches p_{\theta}(z_i|x_i) so that this KL divergence gets as close to 0 as possible.

The distribution we want to "push in" is ultimately the posterior distribution of the latent variable z_i. To that end, let's consider some distribution q(z_i|x_i) that can represent the distribution of the latent variable z_i. (For now, we won't worry about whether q(z_i|x_i) is computable or not.)

Since q(z_i|x_i) is a probability distribution, the following two equations hold (the latter is trivial):

\int q(z_i|x_i) \, dz_i = 1
\log q(z_i|x_i) - \log q(z_i|x_i) = 0

Now, do you see that by using these two equations and the log-likelihood formula, we can construct the following?

D_{\text{KL}}(q(z_i|x_i) \| p_{\theta}(z_i|x_i)) = \int q(z_i|x_i) \log \frac{q(z_i|x_i)}{p_{\theta}(z_i|x_i)} \, dz_i

Let's proceed with the actual transformation.

Formula Transformation Using Our Tools

Let's consider the log-likelihood \log p_{\theta}(x_i).

First, we introduce Tool 1:

\int q(z_i|x_i) \, dz_i = 1

Since the value of this expression is 1, we can multiply it by the log-likelihood \log p_{\theta}(x_i). Furthermore, because this expression does not depend on the latent variable z_i, we can transform it as follows:

\log p_{\theta}(x_i) = \int q(z_i|x_i) \, dz_i \log p_{\theta}(x_i) = \int q(z_i|x_i)\log p_{\theta}(x_i) \, dz_i
Supplement (Regarding independence)

:::

Next, according to Bayes' theorem and the laws of logarithms:

\int q(z_i|x_i)\log p_{\theta}(x_i) \, dz_i = \int q(z_i|x_i)\log \dfrac{p(z_i)p_{\theta}(x_i|z_i)}{p_{\theta}(z_i|x_i)} \, dz_i
= \int q(z_i|x_i)\{\log p(z_i) + \log p_{\theta}(x_i|z_i) - \log p_{\theta}(z_i|x_i)\} \, dz_i

Next, we introduce Tool 2:

\log q(z_i|x_i) - \log q(z_i|x_i) = 0

Using this, the formula can be transformed as follows:

\int q(z_i|x_i)\{\log p(z_i) + \log p_{\theta}(x_i|z_i) - \log p_{\theta}(z_i|x_i)\} \, dz_i

Since Tool 2 equals 0, we can add it inside the brackets:

= \int q(z_i|x_i)\{\log p_{\theta}(x_i|z_i) + \log p(z_i) - \log p_{\theta}(z_i|x_i) + \{\log q(z_i|x_i) - \log q(z_i|x_i) \}\} \, dz_i

Changing the calculation order within the brackets:

= \int q(z_i|x_i)\{\log p_{\theta}(x_i|z_i) + \{\log p(z_i) - \log q(z_i|x_i)\} - \{\log p_{\theta}(z_i|x_i) - \log q(z_i|x_i)\} \} \, dz_i

Converting differences into quotients using the laws of logarithms:

= \int q(z_i|x_i)\log p_{\theta}(x_i|z_i) \, dz_i + \int q(z_i|x_i)\log \dfrac{p(z_i)}{q(z_i|x_i)} \, dz_i - \int q(z_i|x_i) \log \dfrac{p_{\theta}(z_i|x_i)}{q(z_i|x_i)} \, dz_i

Taking the reciprocal of the arguments and flipping signs to form the KL divergence structure:

= \int q(z_i|x_i)\log p_{\theta}(x_i|z_i) \, dz_i - \int q(z_i|x_i)\log \dfrac{q(z_i|x_i)}{p(z_i)} \, dz_i + \int q(z_i|x_i) \log \dfrac{q(z_i|x_i)}{p_{\theta}(z_i|x_i)} \, dz_i

$$
= \mathbb{E}{q(z_i|x_i)}[\log p{\theta}(x_i|z_i)] - \mathrm{KL}\left(q(z_i|x_i) \parallel p(z_i)\right) + \mathrm{KL}\left(q(z_i|x_i) \parallel p_{\theta}(z_i|x_i)\right)

The Identity of the Probability Distribution q

We have been transforming the formula, but at this point, the uncomputable distribution is p_{\theta}(z_i|x_i), and the distribution whose computability is unknown is q(z_i|x_i).

Now, let's consider the distribution q.

Organizing the symbols, x_i is the i-th data point in the ground truth image data within the dataset D, and z_i is the corresponding latent representation. Since q is a probability distribution we independently introduced for the formula transformation, it can be designed freely as long as it remains a distribution of z_i conditioned on x_i.

On the other hand, the index i must be provided for as many data points as there are in the dataset. Therefore, we need to prepare a different distribution for each data point x_i: q(z_0|x_0), q(z_1|x_1), q(z_2|x_2), \cdots , q(z_N|x_N).

While preparing hundreds of millions of distributions for a dataset with hundreds of millions of images is not realistic, we have a powerful tool that can approximate mappings of large amounts of inputs and outputs with a single model.
Yes, neural networks.

So, by considering a neural network with ground truth image data x_i as input, the corresponding latent variable z_i as output, and parameters \psi—in other words, an Encoder—we can rewrite the probability distribution q as follows:

q(z_i|x_i) = q_{\psi}(z_i|x_i)

From the results above, we find that the log-likelihood \log p_{\theta}(x_i) can be transformed as follows:

\log p_{\theta}(x_i) = \mathbb{E}_{q_{\psi}(z_i|x_i)}[\log p_{\theta}(x_i|z_i)] - \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p(z_i)\right) + \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p_{\theta}(z_i|x_i)\right)

Interesting Relationship Between Log-Likelihood and ELBO

Derivation of ELBO

Now, let's first derive the ELBO, which was our initial goal.

As a result of transforming the log-likelihood formula, there is one uncomputable term: \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p_{\theta}(z_i|x_i)\right).

As mentioned repeatedly, the posterior probability p_{\theta}(z_i|x_i) contained in this KL divergence is uncomputable.

Therefore, by utilizing the fact that KL divergence is non-negative, the following inequality transformation is possible:

\log p_{\theta}(x_i) \geq \mathbb{E}_{q_{\psi}(z_i|x_i)}[\log p_{\theta}(x_i|z_i)] - \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p(z_i)\right)

This right-hand side of the expression is called the ELBO (Evidence Lower Bound), and it is clearly the lower bound of the log-likelihood.

(Supplement) Derivation of ELBO using Jensen's inequality

I will omit the detailed explanation, but Jensen's inequality is a theorem stating the relationship between the weighted sum of values after transformation by a convex function or a concave function (like log) and the transformation of the weighted sum of values.
(Assuming the sum of the weights is 1, the weighted sum after transformation is greater for convex functions, while for concave functions, the transformation of the weighted sum is greater.)

Incidentally, Jensen's inequality for the log function can be expressed as follows:

\log \int q(z_i)f(z_i) \, dz_i \geq \int q(z_i) \log f(z_i)\, dz_i

Using this, the log-likelihood can be transformed as follows:

\log p_{\theta}(x_i) = \log \int p_{\theta}(x_i \mid z_i) p(z_i) \, dz_i
= \log \int q_{\psi}(z_i|x_i) \dfrac{p_{\theta}(x_i \mid z_i)p(z_i)}{q_{\psi}(z_i|x_i)} \, dz_i

(Using Jensen's inequality here)

\geq \int q_{\psi}(z_i|x_i) \log \dfrac{p_{\theta}(x_i \mid z_i)p(z_i)}{q_{\psi}(z_i|x_i)} \, dz_i
= \mathbb{E}_{q_{\psi}(z_i|x_i)}[\log p_{\theta}(x_i|z_i)] - \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p(z_i)\right) = \mathrm{ELBO}

This makes deriving the ELBO simple.

However, deriving it using KL divergence makes the intent of the formula easier to understand, so I recommend that beginners understand that derivation instead.

Conditions for ELBO to approach log-likelihood

Even if we maximize the ELBO instead of the log-likelihood, optimization should be more efficient if the ELBO and log-likelihood are as close as possible. Therefore, it is meaningful to consider under what conditions the ELBO and log-likelihood become close.

As you can see from the discussion so far, the ELBO approaches the log-likelihood when the KL divergence \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p_{\theta}(z_i|x_i)\right) approaches 0.

However, since all we can do is maximize the ELBO, this KL divergence is not included in the optimization target.
Therefore, we need to see how the KL divergence changes as we maximize the ELBO.

Recalling the transformation of the log-likelihood:

\log p_{\theta}(x_i) = \mathrm{ELBO}_{\theta, \psi} + \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p_{\theta}(z_i|x_i)\right)

Looking at the left side of the equation, we can see that the parameters affecting the log-likelihood \log p_{\theta}(x_i) itself are only the decoder parameters \theta.

In other words, changing the encoder parameters \psi does not change the distribution of the log-likelihood \log p_{\theta}(x_i) itself.
What changes is the value of \mathrm{ELBO}_{\theta, \psi}.

This means that by optimizing the encoder parameters \psi, the value of \mathrm{ELBO}_{\theta, \psi} can be increased under conditions where the value of the log-likelihood is constant.

Returning to the log-likelihood transformation, if the left side is constant and the first term on the right becomes larger, the second term must necessarily become "smaller."

Since the second term is a KL divergence and is non-negative, it approaches 0.

Therefore, by proceeding with VAE training to maximize the ELBO, the uncomputable distribution p_{\theta}(z_i|x_i) is approximated by q_{\psi}(z_i|x_i).

As mentioned before, the encoder outputs mean and variance parameters to construct the distribution q_{\psi}(z_i|x_i) as a normal distribution.
On the other hand, the uncomputable distribution p_{\theta}(z_i|x_i) is expected to be a very complex distribution, unlike a normal distribution.

Therefore, it is assumed that this KL divergence will not become 0, but the complex p_{\theta}(z_i|x_i) is still approximated by the simple normal distribution q_{\psi}(z_i|x_i), bringing the ELBO and log-likelihood as close together as possible.

This technique of approximating an uncomputable distribution with a simple distribution such as a normal distribution is called "Variational Approximation."
I heard that the 'V' in VAE comes from this 'Variational.'

Analysis of the ELBO

Next, let's look at each term of the ELBO in detail. For clarity, the ELBO is presented below.

Considering the First Term

ELBO's first term is as follows:

\mathbb{E}_{q_{\psi}(z_i|x_i)}[\log p_{\theta}(x_i|z_i)]

This is the expectation of the conditional log-likelihood of the ground truth image data x_i given the latent representation z_i in the decoder, based on the encoder's conditional distribution.

In deep learning contexts, an expectation can be viewed as the average of results obtained from a large amount of data. If we approximate the expectation with a sample size of 1 and assume q_{\psi}(z_i|x_i) is modeled by the encoder, the formula can be transformed as follows:

\mathbb{E}_{q_{\psi}(z_i|x_i)}[\log p_{\theta}(x_i|z_i)] \approx \log \mathcal{N}(x_i; \hat{x_i}, I)

Where:

  • x_i is the i-th ground truth image data from the dataset D.
  • z_i is the latent representation sampled from a normal distribution formed by the mean and variance parameters obtained from the encoder with parameters \psi and input x_i.
  • \hat{x_i} is the output image data from the decoder with parameters \theta and input z_i.
  • \mathcal{N}(x_i; \hat{x_i}, I) is the probability of the ground truth image data x_i in a normal distribution with mean \hat{x_i} and variance I.

Therefore, if you want to maximize the first term of the ELBO, you just need to maximize \log \mathcal{N}(x_i; \hat{x_i}, I).

First, let's look at the probability density function of a multivariate normal distribution. Considering a general form where the variance-covariance matrix is \Sigma, it is as follows:

\mathcal{N}(x_i; \hat{x_i}, \Sigma) = \frac{1}{(2\pi)^{d/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (x_i - \hat{x_i})^T \Sigma^{-1} (x_i - \hat{x_i}) \right)

Here, d is the dimensionality of x_i (in terms of images, "pixels x number of channels"). Taking the logarithm gives us:

\log \mathcal{N}(x_i; \hat{x_i}, \Sigma) = -\frac{d}{2} \log (2\pi) - \frac{1}{2} \log |\Sigma| - \frac{1}{2} (x_i - \hat{x_i})^T \Sigma^{-1} (x_i - \hat{x_i})

Now, let's consider the specific case where the covariance matrix is the identity matrix, which is our current problem setting. This results in the following:

\log \mathcal{N}(x_i; \hat{x_i}, I) = -\frac{d}{2} \log (2\pi)- \frac{1}{2} (x_i - \hat{x_i})^T (x_i - \hat{x_i})
= -\frac{d}{2} \log (2\pi) - \frac{1}{2} \| x_i - \hat{x_i} \|^2

In solving the optimization problem, the first term is a constant and can be ignored. Therefore:
Maximizing the first term of the ELBO boils down to the problem of minimizing the squared error between the ground truth image data and the decoder output.

Considering the Second Term

The second term of the ELBO is as follows:

- \mathrm{KL}\left(q_{\psi}(z_i|x_i) \parallel p(z_i)\right)

In maximizing the ELBO, since the second term is negative, we need to bring the non-negative KL divergence closer to 0.

The distributions to be brought closer together are the encoder's probability distribution (q_{\psi}(z_i|x_i)) and the prior distribution of the latent representation z_i, p(z_i).

As in Bayesian statistics, the prior distribution is one we can define as we see fit. However, setting an arbitrary distribution will degrade accuracy, so we must specify a reasonably valid distribution.

Additionally, a VAE is required to function as a generative AI. This means that if we sample a latent variable z appropriately, a natural image must be reconstructed. Therefore, unlike a standard Autoencoder, the latent representations must be distributed densely.

Thus, VAEs set a standard normal distribution with mean 0 and variance I as the prior distribution for the latent representation z_i.

Why the Standard Normal Distribution is Used as the Prior

There are several reasons why the standard normal distribution with mean 0 and variance I is set as the prior distribution of the latent representation z_i in a VAE.

First, the normal distribution is the distribution that appears as a result of solving the entropy maximization problem using Lagrange multipliers; under the condition of mean 0 and variance I, it is the distribution with the maximum entropy. Therefore, it is an excellent distribution as a prior because it does not add unnecessary information or bias beyond the mean 0 and variance I condition.

Second, it is a distribution that can be easily analyzed. Among the few distributions where the KL divergence can be solved analytically, the normal distribution is one of them. Therefore, choosing a normal distribution for calculating KL divergence is inevitable. Furthermore, using a mean of 0 and a variance of I makes the calculation even simpler.

Third, it provides regularization for the latent space. By assuming a standard normal distribution as the prior for the latent variable z, the VAE structures the latent representations across the entire latent space, regularizing the samples of latent representations to follow the standard normal distribution. Consequently, when used as a generative AI, it becomes easier to guarantee that meaningful natural images are reconstructed by using latent representations sampled from the standard normal distribution.

Analytically Solving the KL Divergence

Next, let's solve the KL divergence of the second term of the ELBO analytically. It is known that the KL divergence between two normal distributions is expressed by the following formula (I hope you can accept this as a given):

\mathrm{KL}(q \parallel p) = \frac{1}{2} \left( \mathrm{tr}(\Sigma_p^{-1} \Sigma_q) + (\mu_p - \mu_q)^T \Sigma_p^{-1} (\mu_p - \mu_q) - k + \log \frac{\det \Sigma_p}{\det \Sigma_q} \right)

Where the two normal distributions are assumed to be:

\mathcal{N}(\mu_p, \Sigma_p), \mathcal{N}(\mu_q, \Sigma_q)

Applying this to our specific KL divergence, we get:

\mathrm{KL}(q_{\psi}(z_i|x_i) \parallel p(z_i)) = \frac{1}{2} \sum_{j=1}^D \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right)

Here, D is the dimensionality of the latent representation z_i. From the formula above, it's clear that the second term can be calculated using the mean and variance parameters output by the encoder.

The Final Expression

In the end, the ELBO reduces to the following expression:

\mathrm{ELBO} \approx - \frac{1}{2} \| x_i - \hat{x_i} \|^2 - \frac{1}{2} \sum_{j=1}^D \left( \sigma_j^2 + \mu_j^2 - 1 - \log \sigma_j^2 \right) + \mathrm{const}

From VAE to Diffusion Models

Well done so far.
By now, I believe you have understood everything from maximizing the log-likelihood to maximizing the ELBO, which are the objective functions of VAEs.

However, the discussion so far has focused on VAEs.
From here on, I would like to consider diffusion models.

But diffusion models are actually simpler.
This is because there are no learnable parameters in the Encoder.

Let me restate the log-likelihood transformation for VAEs below.

The structure of a diffusion model is basically very similar to what is called a multi-layered VAE.
The difference is that there are no parameters in the Encoder, and the next stage of latent representation is obtained by repeatedly performing the same process on the current latent representation.
(For more details, please see this article.)

In other words, you can think of it as a VAE log-likelihood transformation without the Encoder parameters.

And if there are no parameters, they can be ignored in the optimization problem.

I will omit the detailed explanation as I am exhausted, but finally, only the squared error term between the generated image and the ground truth image remains. Therefore, in diffusion models, the objective function becomes simply the minimization of the squared error between the generated image and the ground truth image.

It's quite straightforward.

Summary

Thank you for reading this far!

Understanding these discussions will help you understand the objective functions of generative AI systems, which I believe will make papers easier to read.
I hope this is helpful to everyone!

Discussion