The recent development in machine learning has led to outstanding results in generative models. Deep neural networks have been successfully exploited to generate many realistic content, such as text, video, music, and image content, as well as transform these contents from one genre to another (X-to-Y generative models). Among the generative models, particular success has been encountered by architectures such as Generative Adversarial Networks (GANs) and Diffusion Models (DM).
GANs have the benefit of providing a clear statistical objective and are free from normalization issues that plague diffusion models. However, training of GANs remains a difficult task, and models often suffer from a phenomenon known as mode collapse, which forces the model to generate only a limited number of modes. Currently, GAN-based image synthesis models are still mostly restricted to low-resolution images.
On the other hand, diffusion models (DMs) can generate high-quality images through iterative denoising from random noise. These models are usually guided from an input text prompt which provides a description of the desired image. This prompt text is mainly used at the early sampling stage, when the input data to the denoising network is closer to the random noise. In the later stages, as the generation continues, the model gradually shifts towards visual features to denoise images, mostly ignoring the input text prompt. This behavior can lead to inconsistencies between the fine-grained details included in the text and the output image. Motivated by this observation, the presented work, termed eDiff-I, aims at increasing the capacity of diffusion models by training an ensemble of expert denoisers, each specialized for a particular stage in the generation process.
In most existing works on diffusion models, the denoising model is shared across all stages. However, this may not be effective in learning the complex temporal dynamics of the denoising diffusion from data due to the limited capacity of the shared model. Therefore, eDiff-I is designed to scale up the capacity of the denoising model by introducing an ensemble of expert denoisers; each expert denoiser is a denoising model specialized for a particular range of noise levels. To implement this architecture without adding extra computational costs compared to standard diffusion models, a shared expert denoiser is first trained across all noise levels and then fine-tuned for each denoising stage.
In the end, the system comprises three expert denoisers: one focusing on high noise levels, one focusing on low levels, and one focusing on intermediate levels.
The above picture depicts the system pipeline. At the end of the base diffusion process, eDiff-I generates images in 64×64 resolution. When the image is ready, two super-resolution diffusion models are trained to upsample it to 256×256 and 1024×1024 resolution, respectively. All models are conditioned on text embeddings provided by T5 and CLIP, two popular text encoders.
In addition to text prompt only, the authors propose a method named paint-with-words that enables users to specify the spatial locations of objects by doodling on a canvas. This paint is converted into a binary mask and inputted to all cross-attention layers.
From the figure above, it is possible to notice the improvement provided by eDiff-I compared to the baseline approach. The images generated by the proposed model not only seem realistic but also contain all the details specified in the text description, contrary to the baseline method.
As stated by the authors, the main contributions of this work are multiple. Firstly (i) they observe the different behavior of the generative model based on the denoising stage. Secondly, (ii) they propose a series of expert denoisers to better guide the style transfer from the input prompt text. Lastly (iii), they provide a method called paint-with-words to enable users to specify the spatial locations of objects by doodling on a canvas.
This was a summary of eDiff-I, a novel text-to-image diffusion model to guarantee high correlation between the text prompts and the generated image. You can find more information in the links below if you want to learn more about it.
Check out the paper and project. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Daniele Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Institute of Information Technology (ITEC) at the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working in the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.