You can use/combine multiple metrics to improve the image generation quality (source: Loss Functions for Neural Networks for Image Processing):
- MS-SSIM preserves the contrast in high-frequency regions
- L1 preserves colors and luminance
- LPIPS for scene understanding
In generative modeling, especially in vision, it is a well known observation that using
We look at the foundation first.
$$L = \frac{1}{2M} \sum_i^M \sum_j^N (\hat{x}{ij} - x{ij})^2 $$
where
Now, we will try to interpret that in probabilistic setting.
Let's take a look at Gaussian distribution. It is defined as follows:
where
If we set
We could further applying
we get the same equation as
In conclusion, minimizing
Now we consider the nature of the real distribution of our data. Suppose we want to search image of "car". Inputting this query into Google Image would yield many images of car, and those images are different to each other, be it in term of color, shape, etc. Therefore we know that there are multiple ways to generate images from a single word "car", all equally possible (as in if we look at an image, we would immediately think of "car"). As there are multiple possible way of generating "car" image, we could think that the distribution of "car" images have multiple peaks. In other words, we say that the distribution of images is multimodal.
Here is the problem. In our loss function above, we assume that a particular image comes from a Gaussian. This is a unimodal distribution, meaning that there is only a single peak in it. What would happen if we fit a unimodal distribution to a multimodal one using
Let's simplify our multimodal distribution into bimodal distribution (two modes/peaks). And let's fit a Gaussian to approximate it. We would get something like this:
During the optimization process, we present two types of samples, originated from left and right modes. As the
The implication is if we sample from our Gaussian, the sample would come from the middle of the two mode in image space, even though that region is in reality has very low probability. Therefore, our sample would be the average of samples that comes from those two modes, hence we get blurry image.
To make this idea more concrete, suppose those two modes represent "sedan" and "suv". If we sample from our Gaussian, what we get is somewhere between those two types of car. Now, imagine this in high dimensional space, with many more modes for all properties of "car". Surely by mixing many properties of different types of cars will not make an image realistic to us, however, in term of