compvis/stable-diffusion

A latent text-to-image diffusion model

72,910 stars Jupyter Notebook 6 components

Generates images from text prompts using latent diffusion in a compressed autoencoder space

Images and text enter the system during training where images are encoded to latent space, noise is added according to a schedule, and a UNet learns to predict this noise conditioned on CLIP text embeddings. During inference, the process reverses: text is encoded to embeddings, random noise is iteratively denoised by the trained UNet guided by these embeddings, and the resulting clean latents are decoded back to pixel images.

Under the hood, the system uses 3 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.

A 6-component ml inference. 43 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Encode images to latent space — AutoencoderKL.encode() compresses RGB images from 512x512x3 to 64x64x4 latent representations, applying VAE encoding with KL divergence regularization to ensure smooth latent distributions suitable for diffusion [RGB pixel images → LatentTensor] (config: model.params.embed_dim, model.params.ddconfig.z_channels)
Encode text prompts — Frozen CLIP ViT-L/14 text encoder processes input text prompts into 768-dimensional token embeddings with padding/truncation to fixed sequence length for consistent conditioning [Text prompts → TextEmbedding] (config: model.params.cond_stage_key)
Add scheduled noise — Forward diffusion process adds Gaussian noise to clean latents according to beta schedule — the noise amount depends on randomly sampled timestep t, with more noise at higher timesteps [LatentTensor → Noisy latents] (config: model.params.linear_start, model.params.linear_end, model.params.timesteps)
Predict noise with UNet — UNetModel takes noisy latents, timestep embeddings, and text embeddings as input — uses ResNet blocks with cross-attention to predict the noise that was added, guided by text conditioning through SpatialTransformer layers [DiffusionBatch → Predicted noise] (config: model.params.unet_config.model_channels, model.params.unet_config.attention_resolutions)
Compute training loss — L2 loss between predicted noise and actual noise added during forward process — LatentDiffusion.training_step() computes this denoising objective with optional classifier-free guidance weighting [Predicted noise → Training loss] (config: model.params.loss_type)
Sample via reverse diffusion — DDIMSampler.sample() starts with pure noise and iteratively removes predicted noise over decreasing timesteps — uses deterministic DDIM sampling for faster generation than stochastic DDPM [Random noise → Clean LatentTensor]
Decode latents to pixels — AutoencoderKL.decode() upsamples clean 64x64x4 latents back to 512x512x3 RGB images using the decoder network trained to reconstruct pixel-space images from latent codes [LatentTensor → Generated images] (config: model.params.ddconfig.resolution)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

LatentTensor ldm/models/autoencoder.py
torch.Tensor[batch_size, 4, height//8, width//8] — 4-channel latent representation with 8x spatial downsampling from input images
Created by VAE encoder from pixel images, transformed through diffusion process, then decoded back to pixels

TextEmbedding ldm/modules/diffusionmodules/openaimodel.py
torch.Tensor[batch_size, max_sequence_length, 768] — CLIP text embeddings with 768-dimensional features per token
Generated from text prompts via frozen CLIP encoder, used as conditioning signal in UNet cross-attention layers

DiffusionBatch ldm/models/diffusion/ddpm.py
dict with 'image': LatentTensor, 'caption': TextEmbedding, timesteps: torch.LongTensor[batch_size]
Assembled from latent codes and text embeddings, augmented with random timesteps for training the denoising objective

NoiseSchedule ldm/modules/diffusionmodules/util.py
dict with betas: torch.Tensor[1000], alphas_cumprod: torch.Tensor[1000] — noise variance schedule over diffusion timesteps
Computed once during model initialization, defines how noise is added during training and removed during sampling

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape weakly guarded

Assumes all tensors from model.alphas_cumprod, model.betas have exactly self.ddpm_num_timesteps elements (shape[0])

If this fails: If model was trained with different timestep count than expected (e.g., 500 vs 1000), register_buffer silently creates misaligned schedules causing sampling to use wrong noise levels and generate corrupted images

ldm/models/diffusion/ddim.py:register_buffer

critical Domain unguarded

Assumes input images are RGB with values in [0,1] range and spatial dimensions divisible by encoder downsampling factor (typically 8x)

If this fails: If dataloader provides images in [-1,1] range or with odd dimensions like 513x513, encoder produces wrong latent codes without error - VAE may encode the out-of-range values incorrectly leading to generation artifacts

ldm/models/autoencoder.py:VQModel

critical Ordering unguarded

Assumes device parameter matches the actual device of tensors that will consume the generated random values

If this fails: If called with device='cpu' but tensors are on GPU, subsequent operations trigger expensive CPU->GPU transfers during every sampling step, causing 10x+ slowdown without clear error messages

ldm/models/diffusion/ddpm.py:uniform_on_device

critical Contract unguarded

Assumes cross-attention context tensors have exactly 768 dimensions (CLIP ViT-L/14 embedding size) and sequence length matches expected max_seq_len

If this fails: If using different text encoder (e.g., CLIP ViT-B/32 with 512 dims), attention weights are initialized for wrong dimensions causing matrix multiplication shape errors or silent attention computation on padded/truncated features

ldm/modules/attention.py:SpatialTransformer

warning Scale unguarded

Assumes ddim_num_steps is much smaller than ddpm_num_timesteps (e.g., 50 vs 1000) for meaningful acceleration

If this fails: If user sets ddim_num_steps=999 when ddpm_num_timesteps=1000, timestep spacing becomes [0,1,2,...,998] instead of [0,20,40,...,980], defeating DDIM's purpose and causing near-identical quality but same computational cost as full DDPM

ldm/models/diffusion/ddim.py:make_schedule

warning Resource unguarded

Assumes sufficient GPU memory for batch_size * image_resolution^2 * 4_channels * precision_bytes, typically requiring 24GB+ for batch_size=4 at 512x512

If this fails: With default config on smaller GPUs (8GB), training crashes with CUDA OOM after several steps when gradients accumulate, but error message doesn't indicate which config parameters to reduce

main.py:get_parser

warning Temporal unguarded

Assumes EMA weights are updated after every training step in consistent order without skipped steps

If this fails: If training is resumed from checkpoint but EMA decay rate changed, or if validation steps modify model parameters, EMA weights become desynchronized leading to worse inference quality without validation metrics detecting the degradation

ldm/modules/ema.py:LitEma

warning Environment unguarded

Assumes CUDA is available and model.device points to a valid GPU device

If this fails: Hard-coded .to(torch.device('cuda')) call fails on CPU-only systems or when CUDA_VISIBLE_DEVICES hides GPUs, causing immediate crash during sampler initialization with unclear error about device mismatch

ldm/models/diffusion/ddim.py:register_buffer

warning Contract unguarded

Assumes config dict contains '_target_' key pointing to importable class path and 'params' containing valid constructor arguments

If this fails: Typos in config YAML like '_target_': 'ldm.models.diffusion.ddpm.LatentDiffusion' (missing 's') cause AttributeError during model creation, but error doesn't indicate which config file or which component failed to instantiate

ldm/util.py:instantiate_from_config

warning Domain weakly guarded

Assumes timestep embeddings are integers in range [0, num_timesteps-1] and spatial input dimensions match expected attention_resolutions

If this fails: If timesteps contain negative values or exceed training range, or if latent spatial size doesn't divide evenly by attention_resolutions, the model processes invalid timestep embeddings or misaligned spatial attention leading to generation artifacts

ldm/modules/diffusionmodules/openaimodel.py:UNetModel

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Checkpoints (checkpoint)
PyTorch Lightning saves model weights, optimizer state, and training metadata — enables resuming training and loading pretrained weights for inference

EMA Weights (state-store)
Exponential moving average of model parameters maintained during training — provides more stable inference weights by averaging recent parameter updates

Noise Schedule Cache (buffer)
Precomputed beta, alpha, and cumulative alpha values for all 1000 timesteps — avoids recomputation during training and sampling

Feedback Loops

Diffusion Training Loop (training-loop, balancing) — Trigger: Forward pass through LatentDiffusion.training_step(). Action: Compute noise prediction loss, backpropagate gradients, update UNet parameters. Exit: Maximum training steps reached or validation loss converges.
Reverse Sampling Loop (recursive, balancing) — Trigger: DDIMSampler.sample() call with random noise. Action: UNet predicts noise, noise is subtracted from current latent, timestep decreases. Exit: Timestep reaches 0 (clean image generated).
EMA Weight Update (gradient-accumulation, reinforcing) — Trigger: Each training step completion. Action: Update exponential moving average: ema = decay * ema + (1-decay) * current_weights. Exit: Training ends.

Delays

Autoencoder Warmup (warmup, ~Several epochs) — Autoencoder must learn good latent representations before diffusion training can begin effectively
Sampling Denoising (async-processing, ~50-1000 denoising steps) — Each generated image requires multiple UNet forward passes — DDIM reduces from 1000 to ~50 steps
Cross-attention Computation (compilation, ~Per forward pass) — Text conditioning requires expensive attention computation between spatial features and text embeddings

Control Points

Guidance Scale (hyperparameter) — Controls: Strength of text conditioning via classifier-free guidance — higher values increase text adherence but reduce diversity. Default: 7.5
DDIM Steps (sampling-strategy) — Controls: Number of denoising steps during sampling — fewer steps generate faster but potentially lower quality. Default: 50
Learning Rate Schedule (hyperparameter) — Controls: Training optimization via cosine annealing with warmup — affects convergence speed and final model quality
Latent Channels (architecture-switch) — Controls: Dimensionality of latent space (4, 8, 16, 64 channels) — trades compression ratio vs reconstruction quality. Default: 4
Image Resolution (architecture-switch) — Controls: Training and inference image size — affects memory usage and generation quality. Default: 512

Technology Stack

PyTorch (framework)
Core tensor operations, automatic differentiation, and neural network building blocks

PyTorch Lightning (framework)
Training orchestration with multi-GPU support, checkpointing, and logging infrastructure

CLIP (library)
Frozen text encoder that converts prompts to 768-dimensional embeddings for conditioning

OmegaConf (library)
Hierarchical configuration management for model architectures and training hyperparameters

einops (library)
Tensor reshaping operations for attention mechanisms and spatial transformations

transformers (library)
Hugging Face library providing CLIP model implementations and tokenization

Key Components

AutoencoderKL (encoder) — Variational autoencoder that compresses RGB images to 4-channel latent space with 8x8 spatial downsampling, using KL divergence regularization to ensure smooth latent distributions ldm/models/autoencoder.py
LatentDiffusion (orchestrator) — Main diffusion model that coordinates the entire generation process — handles noise scheduling, conditioning integration, and orchestrates the UNet denoising steps with optional classifier-free guidance ldm/models/diffusion/ddpm.py
UNetModel (processor) — Core denoising network that predicts noise to remove from latent representations — uses ResNet blocks with self-attention and cross-attention layers to process both spatial and text conditioning information ldm/modules/diffusionmodules/openaimodel.py
SpatialTransformer (transformer) — Attention mechanism that enables text-conditioning by allowing spatial features to attend to text embeddings via cross-attention — implements both self-attention for spatial coherence and cross-attention for conditioning ldm/modules/attention.py
DDIMSampler (executor) — Deterministic sampling algorithm that generates images by iteratively denoising random latents — uses deterministic DDIM steps instead of stochastic DDPM for faster, more controllable generation ldm/models/diffusion/ddim.py
instantiate_from_config (factory) — Configuration-based object factory that dynamically creates model components from YAML configs — enables flexible model architecture definition without hardcoded class dependencies ldm/util.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is stable-diffusion used for?

Generates images from text prompts using latent diffusion in a compressed autoencoder space compvis/stable-diffusion is a 6-component ml inference written in Jupyter Notebook. Data flows through 7 distinct pipeline stages. The codebase contains 43 files.

How is stable-diffusion architected?

stable-diffusion is organized into 4 architecture layers: Autoencoder Layer, Diffusion Layer, Attention Layer, Training Infrastructure. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through stable-diffusion?

Data moves through 7 stages: Encode images to latent space → Encode text prompts → Add scheduled noise → Predict noise with UNet → Compute training loss → .... Images and text enter the system during training where images are encoded to latent space, noise is added according to a schedule, and a UNet learns to predict this noise conditioned on CLIP text embeddings. During inference, the process reverses: text is encoded to embeddings, random noise is iteratively denoised by the trained UNet guided by these embeddings, and the resulting clean latents are decoded back to pixel images. This pipeline design reflects a complex multi-stage processing system.

What technologies does stable-diffusion use?

The core stack includes PyTorch (Core tensor operations, automatic differentiation, and neural network building blocks), PyTorch Lightning (Training orchestration with multi-GPU support, checkpointing, and logging infrastructure), CLIP (Frozen text encoder that converts prompts to 768-dimensional embeddings for conditioning), OmegaConf (Hierarchical configuration management for model architectures and training hyperparameters), einops (Tensor reshaping operations for attention mechanisms and spatial transformations), transformers (Hugging Face library providing CLIP model implementations and tokenization). A focused set of dependencies that keeps the build manageable.

What system dynamics does stable-diffusion have?

stable-diffusion exhibits 3 data pools (Model Checkpoints, EMA Weights), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and recursive. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does stable-diffusion use?

4 design patterns detected: Latent Space Operation, Cross-Attention Conditioning, Classifier-Free Guidance, Configuration-Driven Architecture.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.