Compression for better image generation

My NeurIPS 2020 paper.

Wow, what a year! I just published my first paper, titled “A Loss Function for Generative Neural Networks Based on Watson’s Perceptual Model” at NeurIPS 2020.

The problem

To train a form of generative model, the Variational AutoEncoder (VAE), for image generation, one needs to define a function to measure the distance between two images. Trivial approaches include the pixel-wise mean squared error (MSE) or the patch-wise SSIM, but both of these metrics are know to perform quite poor on images.

Imagine an image shifted by just a few pixels. Perceptually, it is almost the same image, so the distance should be small. However, for traditional distance measures, the distance can get quite large even for perceptually similar images.

A recently emerging field to tackle this problem are perceptual similarity metrics, which try to model human perception of visual data more accurately. These are typically based on deep features extracted from image classification networks. However, this approach has two downsides:

  • The perceptual models used are often excessively large, limiting the GPU memory available for the generative model
  • The perceptual models overfit to the data they are trained one, and can not be adapted to other domains.

The solution

We design a perceptual similarity metric based on ideas from image compression. In the 1990s, researcher where trying to define the perceptual impact of image components to design better lossy image compression algorithms, which ultimately culminated in the JPEG compression standard widely used today. The perceptual models used in compression are lightweight, applicable to a large variety of image domains, and well proven in practice.

Results

We find that generative models trained with our perceptual metric Watson-DFT produce higher quality images compared to traditional metrics like MSE or SSIM, while not suffering from domain adaptation problems as deeper perceptual models.

Perceptual similarity metrics can drastically improve the quality of generated images. The SSIM baseline provides blurry samples, while images generated by our Watson-DFT and the deep-learning based Deeploss-VGG are of higher quality.
Our Watson-DFT metric generalizes better to new domains: Models trained with the deep-learning based Deeploss-VGG and Deeploss-Squeeze fail to generate MNIST digits.

Video presentation

More quantitative results and much more in the recorded talk:

The source code is available on my Github