Diffusion Models

Last updated: Jun 24, 2026

Author :

Vinay Adari

Diffusion Models

Diffusion Models are the technology behind most of today's state-of-the-art image generators. They create data using a surprising idea borrowed from physics: take an image, slowly destroy it by adding noise, and then train a neural network to reverse that process — turning pure random noise back into a clean, detailed image. To generate something new, you simply start from noise and let the network denoise its way to a brand-new result.

💡 In one line: A diffusion model learns to turn random noise into data by reversing a step-by-step noising process.

The Core Idea: Add Noise, Then Learn to Remove It

A diffusion model is built around two processes that mirror each other:

Forward (diffusion) process — start with a real image and add a tiny bit of random (Gaussian) noise, over and over, across many steps, until the image becomes pure noise. This process is fixed — there's nothing to learn.
Reverse (denoising) process — a neural network learns to undo the noising, one small step at a time, recovering a clean image from noise. This is the part that's learned.

Once trained, generating a new image is easy: start from pure noise and run the reverse process to gradually refine it into a clean, original image.

How It Works

Forward: A clean image x₀ is noised step by step → x₁ → x₂ → … → xₜ, until xₜ is essentially pure noise.
Training: At each step, the network learns to predict the noise that was added. If it can predict the noise, it can subtract it.
Reverse (generation): Start from random noise xₜ. Repeatedly ask the network "what noise is in this?" and remove it, step by step, until you arrive at a clean new image x₀.

The key trick is that the model never has to draw an image from scratch — it only has to remove a little noise at a time, which is a far easier task.

The Denoising Network

The network that predicts the noise is usually a U-Net — an architecture well-suited to images. It's also told which timestep it's working on, so it knows how much noise to expect. Through thousands of training examples, it becomes an expert at spotting and removing noise at any stage.

Text-to-Image Diffusion

To turn a text prompt into an image, the denoising network is conditioned on the text (typically via a mechanism called cross-attention). At each denoising step, the model is nudged toward an image that matches the prompt — so "a cat wearing a hat" steers the noise-removal toward exactly that.

Modern systems also use latent diffusion: instead of working on full-size pixels, they run the diffusion process in a compressed latent space (using an autoencoder), making generation far faster and cheaper. This is how popular open image generators work.

Diffusion vs. GAN vs. VAE

Aspect	Diffusion	GAN	VAE
Output quality	Very high	High	Moderate (blurry)
Training	Stable	Unstable	Stable
Generation speed	Slow (many steps)	Fast (one pass)	Fast
Diversity	Excellent	Risk of mode collapse	Good
Main use	Text-to-image, top quality	Realistic images	Compression, sampling

Pros and Cons of Diffusion Models

✅ Pros (Advantages)	⚠️ Cons (Challenges)
State-of-the-art image quality	Slow to generate (many denoising steps)
Stable training (no adversarial game)	Computationally expensive
Highly diverse outputs (no mode collapse)	Complex underlying maths
Flexible conditioning (text, images)	Heavy compute for high resolution
Great for text-to-image	Slower than GANs at inference

Applications of Diffusion Models

Domain	Use
Image generation	Text-to-image art and photos
Image editing	Inpainting, outpainting, super-resolution
Audio	Music and speech generation
Video	Generating and editing short clips
Science	Molecule and protein generation

Summary

A diffusion model generates data by reversing a noising process — learning to turn random noise into clean data.
The forward process adds noise step by step; the reverse process (a learned network) removes it.
To generate, it starts from noise and denoises repeatedly until a clean image appears.
With text conditioning and latent diffusion, it powers modern text-to-image generators.
Diffusion models offer the highest quality and best diversity, but are slower to generate than GANs.