What is a diffusion model in simple terms?

A diffusion model learns to reverse a noising process. During training, you take real images and gradually add random noise until the image becomes pure static. The model learns to predict and remove that noise step by step. At generation time, you start from pure random noise and run the denoising process in reverse, conditioned on a text prompt, to produce a new image that matches what you asked for.

Why are diffusion models better than GANs?

GANs (Generative Adversarial Networks) suffer from mode collapse, where the generator learns to produce a narrow range of outputs instead of the full diversity of the training data, and from training instability caused by the adversarial game between generator and discriminator. Diffusion models train with a stable objective (predict the noise at each step), scale well with more compute and data, and produce more diverse, controllable outputs. The main trade-off is inference speed: diffusion requires many denoising steps while a GAN is a single forward pass.

What does a text encoder do in image generation?

A text encoder converts your prompt into a sequence of numerical vectors (embeddings) that capture the semantic meaning of each word and phrase. These vectors are then injected into the denoising network via cross-attention at every denoising step, so the model continuously reads your prompt while generating. CLIP encoders are trained for image-text alignment; T5 encoders handle longer, more complex prompts; some models use both in parallel.

What is classifier-free guidance (CFG) and what does the scale do?

CFG is a training and sampling trick that makes the model follow your prompt more strongly. During training, conditioning is randomly dropped out so the model learns both conditional and unconditional generation. At inference, two noise predictions are made (one with your prompt, one without) and extrapolated using: final = unconditional + scale × (conditional - unconditional). A scale of 7 is a common default. Lower values (3-5) give more varied, naturalistic results; higher values (10-15) create strong prompt adherence but can produce oversaturated, artifact-prone images.

What is latent diffusion and why does it use a VAE?

Denoising at full pixel resolution is computationally prohibitive—a 1024x1024 image has over 3 million values. Latent diffusion uses a variational autoencoder (VAE) to compress the image into a smaller latent representation (typically 8x smaller per spatial dimension), runs the diffusion process in that compressed space, then decodes back to pixels. This is why Stable Diffusion runs on consumer GPUs while earlier pixel-space models required much more compute.

What is flow matching and how does it differ from DDPM?

DDPM uses a stochastic noising process with complex noise schedules requiring many steps. Flow matching is a simpler framework where the model learns to transport data along straight-line paths between the noise distribution and the image distribution. Straight paths mean less backtracking during inference, enabling high-quality results in as few as 20 steps. Stable Diffusion 3 and Flux use flow matching; they tend to need fewer inference steps than DDPM models for comparable quality.

What is a LoRA and when should I use one?

LoRA (Low-Rank Adaptation) is a fine-tuning technique that trains a small set of additional weight matrices added to the frozen pretrained model—typically 10-100MB of weights vs. gigabytes for full fine-tuning. Use a LoRA when you want to teach the model a specific style, subject, or character without training from scratch or using enormous compute. You apply a LoRA at inference by loading it alongside the base model with a strength multiplier (0 = ignore, 1 = full strength).

How many sampling steps do I actually need?

It depends on the sampler and architecture. With original DDPM, 1000 steps were needed. Modern samplers like DPM++ 2M Karras produce good results in 25-30 steps for SD-era models. With flow matching models (Flux, SD3), 20 steps is typically sufficient for production-quality results. Fewer than 15 steps tends to produce noticeably degraded results; more than 50 rarely helps with any current model. Start at 20 and increase if you see artifacts.

What are negative prompts and do they always work?

Negative prompts use CFG mechanics to push generation away from concepts. Instead of empty conditioning for the unconditional prediction, the model uses your negative prompt as the unconditional signal—so the CFG formula steers away from those concepts. Common uses include suppressing artifacts, watermarks, and quality degradation. They work well with DDPM-based models (SD 1.x through SDXL) but have inconsistent behavior with flow matching models (Flux, SD3) because the training objective differs.

How Diffusion Models Work: Complete Guide to AI Image Generation (2026) | explainx.ai Blog

Every time you type a prompt into DALL-E 3, Midjourney, or Stable Diffusion and an image appears, the same broad process is running: start from random noise, denoise step by step, guided by your text. The specific implementation differs across systems—different architectures, text encoders, training objectives—but the underlying framework of diffusion is shared by virtually all state-of-the-art image generation models in 2026.

This guide is the complete explainer. After reading it, you will understand not just what diffusion models are but how each component works: the noise schedules, the denoising network, how text enters via cross-attention, why models use a VAE to work in compressed latent space, what classifier-free guidance is actually doing to your images, how U-Net architectures gave way to transformers and then flow matching, and what the practical knobs (steps, CFG scale, seed, negative prompts) actually control. You will also understand LoRA at the conceptual level and see working Python code for two major model families.

Why diffusion? The intuition before the math

Before the equations, it helps to build intuition for why this approach works at all.

The ink-in-water analogy

Imagine dropping a drop of ink into a glass of water. Over time, the ink diffuses outward, spreading through the water until you cannot tell where it started—the water is uniformly colored. This is the forward process: starting from an organized state (the ink drop) and moving toward maximum disorder (uniform distribution).

Now imagine running that process in reverse: given uniformly colored water, could you recover the original ink drop? Of course not—information has been destroyed. But what if you knew exactly how ink diffuses in water—the precise physics at every moment? Then you could, in principle, work backward step by tiny step, inferring where the ink probably was at each prior moment.

Diffusion models do the same thing with images. The forward process turns a real image into pure random noise over many small steps. The reverse process—learned by the neural network—removes noise step by step. The model doesn't recover the original image (that information is gone). Instead, it learns the statistical structure of what images look like, so it can generate plausible images starting from the noise. With text conditioning, it generates plausible images that match your description.

Why not just use GANs?

Generative Adversarial Networks dominated image generation before diffusion. A GAN trains two networks—a generator that creates images and a discriminator that distinguishes fakes from real—in adversarial competition. This works but has fundamental problems:

Mode collapse: the generator learns to fool the discriminator with a limited set of outputs rather than capturing the full diversity of real images. A GAN trained on human faces might only generate certain face types, ignoring most of the space.

Training instability: the adversarial game is delicate. Both networks must improve at roughly the same rate. If one dominates, training diverges. Reproducing GAN results across different hardware and seeds was notoriously difficult.

Difficulty scaling: producing consistently high-quality, high-resolution, diverse outputs required increasingly elaborate architectural tricks (progressive growing, attention mechanisms, style injection).

Diffusion models have a stable, well-defined training objective—predict the noise at each step—and scale predictably with more compute and data. The main trade-off is inference speed: diffusion requires many sequential denoising steps, while GAN inference is a single forward pass. Modern samplers and distillation techniques have narrowed this gap substantially.

The forward process: adding noise during training

The forward process is not what happens when you generate an image—it is what happens during training. Understanding it is essential for understanding what the model is actually learning.

What happens mathematically

Starting from a real training image x₀, you add Gaussian noise in T small steps to produce progressively noisier versions x₁, x₂, ..., x_T. By step T (typically 1000 in original DDPM), the image is essentially pure random noise—statistically indistinguishable from sampling directly from a standard Gaussian distribution.

The math is structured so you can compute x_t (the noisy version at any timestep t) directly from x₀ in a single calculation, without running through all intermediate steps. Concretely:

snippet

x_t = √(ᾱ_t) · x₀ + √(1 - ᾱ_t) · ε

where ε is standard Gaussian noise and ᾱ_t is the cumulative noise schedule value at timestep t. This closed-form means you can sample a random timestep t during training, noise the image to x_t in one shot, and use that as a training example.

What the model learns: given x_t (the noisy image), t (which timestep), and a conditioning signal c (the text embedding), predict the noise ε that was added. The training loss is mean squared error between predicted and actual noise:

snippet

L = E[||ε - ε_θ(x_t, t, c)||²]

The model is learning: "given this noisy image and this amount of noise, what noise was added?" Running this in reverse lets you go from noise to image.

Noise schedules: how much noise at each step

The noise schedule determines how quickly noise accumulates across the T timesteps. Different schedules have significantly different properties:

Linear schedule (original DDPM): noise variance increases linearly from a small value at t=1 to a large value at t=T. Simple but suboptimal—the model spends too many steps at high noise levels where the signal is almost gone and the model can't learn much.

Cosine schedule: noise follows a cosine curve, spending more steps at intermediate noise levels where the most image structure is encoded. Standard in Stable Diffusion 1.x through SDXL. Better distribution of learning signal across timesteps.

Flow matching (ODE-based): Stable Diffusion 3 and Flux abandon the stochastic diffusion framework entirely and train the model to transport data along straight (or near-straight) paths from the noise distribution to the image distribution. Conceptually simpler, trains more efficiently, and supports fewer inference steps because the paths are shorter and straighter.

The reverse process: generating images at inference

When you generate an image, the forward process doesn't happen. You start from pure Gaussian noise x_T and run the reverse process to produce x₀—your generated image.

The denoising loop

The basic generation loop:

Sample x_T from a standard Gaussian distribution — pure random noise
For t = T, T−1, ..., 1:
- Predict noise: ε̂ = ε_θ(x_t, t, text_embedding)
- Estimate the clean image x₀ from x_t and ε̂
- Compute x_ by taking a step toward the estimated x₀ (with appropriate noise depending on the sampler)
Return x₀ as the generated image

Step 2c is where different samplers diverge. The mathematical form of "take a step toward x₀" varies, and this is what drives the practical differences between sampling algorithms.

Samplers: the algorithms that govern each step

The sampler (or scheduler) defines the specific update rule for going from x_t to x_. Understanding the main options helps you choose the right one for your use case.

DDIM (Denoising Diffusion Implicit Models): a deterministic sampler that enables larger steps. Instead of the full stochastic reverse process, DDIM takes a direct path toward the estimated clean image. Result: 20-50 DDIM steps produce quality comparable to 1000 DDPM steps. Because it's deterministic, the same seed always produces the same image regardless of step count.

DPM++ (2M, SDE, 2M Karras, etc.): a family of higher-order ODE solvers that achieve better quality per step than DDIM. DPM++ 2M Karras is arguably the most reliable general-purpose sampler for DDPM-based models through mid-2026—good quality in 20-30 steps, consistent behavior. The "Karras" variant refers to a specific noise schedule adjustment.

Euler / Euler A: Euler is a simple first-order ODE solver—fast but requiring more steps for equivalent quality. Euler A (Ancestral) adds stochasticity at each step: each step injects a small amount of noise back, producing more variation across seeds. Good for creative exploration; less predictable for fine-tuning specific seeds.

LCM / Lightning / Turbo: distilled models that learn to produce good images in 4-8 steps by training the model specifically to skip timesteps. Quality is noticeably lower than full-step models but speed improvement is dramatic—useful for real-time previews.

Step count guidelines

Steps	Quality	Use case
4-8	Rough but recognizable	Real-time preview, rapid iteration
15-20	Good quality with modern samplers	Most production tasks with DPM++ or flow matching
25-35	High quality	Fine-tuned outputs, high-stakes generation
40-50	Marginal improvement	Specific cases where 30 steps has artifacts
1000 (DDPM)	Historical baseline	Research only

The right number depends on the sampler and architecture. With flow matching models (Flux, SD3), 20 steps is typically excellent. With DDPM-based models and DPM++ 2M Karras, 25-30 is a safe default.

How text conditioning works

Text-to-image means the model generates images corresponding to a text prompt. For this to work, the text must influence every denoising step—not just the initialization.

Text encoders: converting words to vectors

The first step is converting your text prompt into a sequence of vectors the denoising network can use. This is done by a text encoder—a separate pre-trained model that runs before any denoising starts.

CLIP (Contrastive Language-Image Pretraining): trained to align image and text representations in a shared embedding space using contrastive learning. CLIP text encoders produce embeddings that directly capture the visual relationship between language and images. Used in Stable Diffusion 1.x, 2.x, and parts of SDXL. Main limitation: CLIP was trained on relatively short image captions, so prompts longer than 77 tokens are truncated and complex multi-clause descriptions may be partially ignored.

T5 (Text-to-Text Transfer Transformer): a large language model trained on text-only tasks. T5-XXL and similar variants understand long, complex, grammatically rich text descriptions far better than CLIP. Imagen uses T5-based encoders extensively; Stable Diffusion 3 uses both T5-XXL and CLIP. T5-conditioned models respond well to prose-style prompts.

Dual encoders: SDXL uses two CLIP variants (ViT-L and ViT-G) in parallel, concatenating their outputs. Flux uses both CLIP-L and T5-XXL. The intuition: different encoders capture different aspects of language. Combining them gives richer, more robust conditioning.

LLM-based encoders: some newer research uses the hidden states of full large language models as text conditioning, hypothesizing that a capable LLM's internal representations should capture richer semantic understanding than CLIP or even T5.

Cross-attention: the mechanism that joins text to image

Once you have text embeddings (a sequence of vectors, one per token), they must influence the denoising network at every step. The mechanism is cross-attention.

In a cross-attention layer inside the denoising network:

Image features at each spatial location form the queries: "what am I looking at and what do I need?"
Text token embeddings form the keys and values: "what information is available from the prompt?"
Attention weights determine how much each image location should attend to each text token
The output is image features updated to reflect relevant textual context

This happens at multiple scales within the network. Deep layers (compressed representation) capture high-level semantic content—what kinds of objects, overall composition, scene type. Shallow layers (near full resolution) capture fine-grained style and texture details. Prompt terms affecting composition influence deep layers; terms about texture and material affect shallow layers.

Why prompt length and word order matter

Behavior with long vs. short prompts depends heavily on the text encoder:

CLIP encoders have a 77-token limit (including special tokens). Anything beyond is truncated. The most important terms should appear early in the prompt.
T5-based models handle much longer prompts (256+ tokens) and can reason about complex multi-part descriptions. Writing almost in prose sentences works.
Word order matters for attention: terms earlier in the prompt tend to receive higher attention weights in practice, especially with CLIP.
Emphasis syntax ((term:1.3) in Stable Diffusion UIs) artificially boosts the attention weight contribution for specific tokens—useful when the model underweights a key concept.

Latent diffusion and the VAE

Running the diffusion process directly on pixel values is computationally prohibitive. A 512×512 RGB image contains 786,432 values. Running hundreds of denoising steps over all those values with a large neural network requires enormous compute—far beyond what consumer hardware can provide in reasonable time.

The variational autoencoder (VAE)

The solution is to not work in pixel space at all. Instead, a variational autoencoder compresses images to a smaller latent representation before diffusion runs, and expands them back to pixels after.

The VAE architecture:

Encoder: takes a high-resolution image (e.g., 512×512×3 pixels) and compresses it to a small latent tensor (e.g., 64×64×4 values)
Decoder: takes a latent tensor and expands it back to full-resolution pixels

The 8× spatial downsampling factor (used in SD 1.x through SDXL) is standard: 512×512 pixels become 64×64 latents, 1024×1024 pixels become 128×128 latents.

The VAE is trained separately from the diffusion model, using:

Reconstruction loss: the decoded image should match the original pixel for pixel
Perceptual loss: the decoded image should be perceptually similar (matching activations in a pre-trained vision network, not just pixel values)
KL divergence term: the latent distribution should be smooth and continuous so the decoder can handle novel latents from the diffusion process

What latent space actually is

Latent space is the lower-dimensional representation the encoder produces. It's not human-interpretable—you cannot look at a 64×64×4 tensor and see what image it represents. But it has important properties:

Compact: representing images with far fewer numbers
Smooth: nearby points in latent space decode to visually similar images; there are no sharp discontinuities
Organized: the encoder learns to cluster related concepts spatially—not in labeled, human-interpretable regions, but in a statistically organized way that the decoder can navigate

The entire diffusion process runs in this latent space. Starting from 64×64×4 random noise, the denoising network iteratively refines the latent over 20-30 steps until it represents a coherent image. Only at the very end does the VAE decoder expand it back to full-resolution pixels.

Why the 8× factor is standard

The 8× spatial compression (with 4 or 16 channels) was found empirically to balance:

Computational efficiency: 64×64 latents have 64× fewer spatial positions than 512×512 pixels
Reconstruction quality: the VAE can still reconstruct fine detail at 8× compression; 16× compression loses too much
Latent expressiveness: enough spatial resolution to represent complex compositional arrangements

Flux and SD3 use 16-channel latents at the same 8× spatial compression, allowing more information per spatial position without changing the compute profile drastically.

Classifier-free guidance (CFG)

CFG is arguably the most impactful practical technique in text-to-image generation. It is what makes models follow prompts strongly rather than producing whatever random plausible images they would generate without guidance.

The training trick

During training, a fraction of examples (typically 10-20%) have their text conditioning randomly dropped out—replaced with an empty embedding or null token. This teaches the model two things simultaneously:

Conditional generation: how to generate images that match a text description
Unconditional generation: how to generate plausible images with no text guidance at all

The dropout doesn't hurt—it actually regularizes the model and ensures it can generate coherently both with and without conditioning.

The sampling trick

At inference, for every denoising step, the model makes two noise predictions:

Conditional: the noise prediction given your actual text prompt
Unconditional: the noise prediction given null/empty conditioning

These are combined using the CFG formula:

snippet

ε_final = ε_unconditional + scale × (ε_conditional − ε_unconditional)

Substituting values:

When scale = 1: ε_final = ε_conditional — pure conditional prediction, no CFG amplification
When scale = 7: the conditional direction is amplified 6× beyond the unconditional baseline
When scale = 0: ε_final = ε_unconditional — completely ignores your prompt

The formula extrapolates beyond the conditional prediction, pulling the generated image strongly in the direction your prompt implies.

What CFG scale does to your images

CFG Scale	Effect	When to use
1-3	Very varied, sometimes ignores prompt	Creative exploration, background generation
4-6	Moderate adherence, naturalistic results	Soft styling, organic-looking images
7-8	Standard — good prompt following	Most generation tasks
9-12	Strong adherence, higher contrast, sharper	When prompt compliance is critical
13+	Oversaturation, artifacts common	Rarely beneficial

The default of 7 is not arbitrary—it is empirically calibrated for good quality-adherence trade-offs across diverse prompts. When your outputs look washed out or the model ignores key terms, try higher CFG. When results look artificial or oversaturated, try lower.

Important note for flow matching models: Flux and SD3 use different CFG defaults (3.5 is common for Flux.1-dev). Do not apply SD-era CFG defaults to flow matching models—the training objective differs, and the optimal range is different.

Architecture evolution: U-Net, DiT, and flow matching

The denoising network's architecture has evolved substantially since the original DDPM paper. Each generation brought meaningful improvements in quality, scalability, and efficiency.

U-Net era (2021-2023): SD 1.x through SDXL

The U-Net is a convolutional neural network with an hourglass shape:

Encoder path (downsampling): series of convolution + downsampling blocks that progressively reduce spatial resolution while increasing channel depth. Captures increasingly abstract, global features.
Bottleneck: the lowest-resolution, highest-channel representation where global image understanding lives.
Decoder path (upsampling): series of convolution + upsampling blocks that restore spatial resolution, recovering fine-grained detail.
Skip connections: outputs from corresponding encoder layers are concatenated with decoder inputs, allowing high-resolution spatial detail to bypass the bottleneck directly.

For text conditioning, cross-attention layers are inserted throughout the U-Net—particularly in the bottleneck and decoder—so the model reads the text prompt at both abstract (composition) and detailed (texture) levels. The current denoising timestep is injected via adaptive layer normalization (AdaLN).

Stable Diffusion 1.4/1.5, SD 2.x, and SDXL all use U-Net backbones. SDXL adds a second, smaller U-Net as a refiner that runs on the latent after the main U-Net to add fine detail.

Limitation of U-Nets: convolutions have a limited receptive field. Relating content at opposite corners of an image requires many layers, which limits the model's ability to maintain global consistency across the full image.

DiT era (2022-2024): diffusion transformers

The DiT (Diffusion Transformer) architecture, introduced in research around 2022-2023, replaces the convolutional U-Net with a Vision Transformer-style architecture:

The latent tensor is divided into small patches (e.g., 2×2 patches of the 64×64 latent produce a 32×32 grid of tokens, plus time and text tokens)
Each patch becomes a token in a sequence
Standard transformer blocks (multi-head self-attention + feedforward) process the full sequence
Timestep and conditioning are injected via adaptive layer norm (adaLN-Zero)
Final output is reshaped back into the latent tensor shape

Why DiT improves on U-Net:

Global attention: self-attention allows any patch to directly attend to any other patch at any depth, not just nearby convolution neighbors. Global spatial relationships become learnable.
Scaling laws: transformers scale more predictably than CNNs with more compute, larger models, and more data. Scaling up a DiT reliably improves quality.
Architectural simplicity: no encoder-decoder asymmetry, no skip connections required.

The quantitative result (from the original DiT paper): larger DiT models achieve lower FID scores on image generation benchmarks with predictable, smooth scaling curves.

Stable Diffusion 3, SD 3.5, and Flux use transformer-based backbones with some architectural variations specific to each.

Flow matching era (2023-2026): Flux and SD3

Flow matching is a different training objective from DDPM/score matching. Instead of learning to predict noise added by a stochastic Markov chain, the model learns a vector field that transports samples from the noise distribution to the image distribution along optimal paths.

The key insight: straight-line paths are more efficient than curved stochastic walks. If the training objective encourages the model to move samples directly toward the data distribution along straight lines, then at inference time fewer steps are needed because each step covers more ground in the right direction.

The training loss for flow matching:

snippet

L = E[||v_θ(x_t, t, c) − (x_1 − x_0)||²]

where v_θ is the model's predicted velocity, x_0 is clean data, x_1 is pure noise, and x_t is the interpolation at timestep t. The target is simply the direction from noise to image along a straight line.

Practical differences from DDPM:

Fewer inference steps: 20 steps is typically sufficient for Flux.1-dev vs. 25-30 for DPM++ on DDPM models
More uniform training signal: DDPM's loss concentrates too much on high-noise timesteps; flow matching distributes learning more evenly
Better with large models: the simpler objective scales more cleanly

Limitation: flow matching models handle negative prompts inconsistently, since the training didn't involve the same kind of conditioning dropout that makes CFG work naturally in DDPM.

Major model families

Understanding what distinguishes each major model family helps you choose the right tool and interpret capability comparisons.

DALL-E 3 (OpenAI)

Architecture: closed source, not publicly documented in detail. Uses diffusion-based generation with tight GPT-4V integration.

Key differentiator — prompt rewriting: DALL-E 3 passes your prompt through GPT-4 before generating. The LLM expands short prompts into detailed descriptions, adds implicit details you didn't specify, and corrects ambiguous language. This is why DALL-E 3 follows even brief prompts so reliably—it's not just reading your words, it's interpreting your intent.

Text rendering: significantly better than earlier models at placing legible, correctly spelled text within images.

Safety: extensive content filtering on inputs and outputs, making it more conservative than open-weights alternatives.

Access: API through OpenAI (also powers ChatGPT image generation), no public model weights.

Best for: general-purpose professional generation where strong prompt following matters and custom fine-tuning is not required.

Stable Diffusion family (Stability AI)

Architecture evolution: SD 1.x/2.x (U-Net + CLIP), SDXL (larger U-Net + dual CLIP), SD3/SD3.5 (transformer + flow matching + T5).

Key differentiator — open weights and ecosystem: any checkpoint can be downloaded, run locally, fine-tuned, extended, and modified. This spawned an enormous ecosystem:

LoRA marketplaces: thousands of style and subject fine-tunes
ControlNet: adds structural control inputs (edge maps, depth maps, body poses) for precise composition guidance
Community tools: ComfyUI (node-based workflow), Automatic1111, Forge, InvokeAI
img2img and inpainting: well-supported pipelines for editing existing images

Best for: custom workflows, production pipelines that need full stack control, specific style fine-tunes, integration with ControlNet for structured outputs.

Flux (Black Forest Labs)

Architecture: transformer-based with flow matching. Three variants: Flux.1-pro (API only, highest quality), Flux.1-dev (open weights, non-commercial, near-pro quality), Flux.1-schnell (open weights, distilled for 4-step inference).

Key differentiators:

State-of-the-art quality: Flux.1 models produced the highest-quality outputs among open-weights systems at their release in mid-2024 and remained competitive through 2026
Text rendering: significantly better than earlier SD models at placing readable, correctly spelled text within images
Flow matching efficiency: 20 steps with Flux.1-dev typically produces excellent results
Open weights: Flux.1-dev available on Hugging Face under a non-commercial license; Flux.1-schnell under Apache 2.0

Best for: highest-quality open-weights generation, text-in-image tasks, production workflows needing state-of-the-art quality without closed-API dependency.

Imagen 3 (Google DeepMind)

Architecture: cascade diffusion (generates at progressively higher resolutions), T5-XXL text encoder, transformer-based backbone.

Key differentiators:

T5-XXL text encoding: exceptional at following long, complex, prose-style prompts. You can write in natural language sentences rather than keyword lists.
Cascade architecture: first generates a low-resolution image, then uses separate upsampler models to add detail at each scale—similar to DALL-E 2's approach
Strong text rendering: competitive with DALL-E 3 for reading text correctly in images
Native Google integration: built into Gemini image generation, available via Vertex AI

Access: Google API only, no open weights.

Best for: complex multi-clause prompts, within Google's product ecosystem (Gemini, Workspace, Vertex).

Midjourney

Architecture: proprietary, not documented. Likely transformer + diffusion with extensive aesthetic fine-tuning.

Key differentiator — aesthetic training: Midjourney is trained and fine-tuned with heavy emphasis on producing visually striking, beautifully composed images. It tends to stylize, beautify, and add visual drama rather than literally follow prompts.

Behavior vs. other models:

Strong aesthetic opinions: outputs have a distinctive "Midjourney look" even across diverse prompts
Less prompt-literal: it interprets and aestheticizes rather than following literally
No API for integration into custom workflows; Discord bot interface or direct web access only
No fine-tuning or custom model weights for users

Best for: creative, artistic image generation where aesthetic quality and visual impact matter more than precise prompt adherence or pipeline integration.

Practical controls: what every parameter actually does

These are the controls available in virtually every image generation interface, explained mechanistically.

Steps

The number of denoising iterations. More steps means more refinement, up to diminishing returns that vary by model and sampler.

Start with 20 for flow matching models (Flux, SD3) and 25 for DDPM models. Increase to 35 if you see incomplete detail. Going above 50 almost never helps.

CFG scale

Controls how strongly the model follows your prompt. The standard default of 7 is a good starting point. Use lower values (4-6) for more natural, varied results. Use higher values (9-12) when the model ignores key prompt elements.

For Flux: start at 3.5. For SD models: start at 7. These defaults reflect the different training objectives.

Seed

A random integer that initializes the starting noise x_T. The same seed plus the same settings always produces the same image. Change the seed for variation; keep it fixed when you want to adjust the prompt while holding the composition stable.

Negative prompts

Text that CFG pushes generation away from. Common patterns:

ugly, blurry, low quality, jpeg artifacts — suppresses image degradation
text, watermark, logo — avoids unwanted overlays
extra fingers, deformed hands, extra limbs — common anatomy fixes for SD 1.x/2.x (less necessary with newer models)
oversaturated, overexposed — tones down excessive CFG effects

Note for flow matching models: negative prompts work inconsistently with Flux and SD3 because the training objective differs from DDPM. Results vary; test before relying on them.

Resolution and aspect ratio

Image resolution affects composition, not just pixel count. The model was trained at specific resolutions (e.g., 512×512 for SD 1.x, 1024×1024 for SDXL, various for Flux). Generating at very different aspect ratios or resolutions than training can produce distorted results (repeated objects, strange compositions). Use the model's native resolution as a baseline.

LoRA and fine-tuning

What LoRA is

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method. For a weight matrix W of shape (d_out × d_in), LoRA learns two smaller matrices A (d_rank × d_in) and B (d_out × d_rank) where rank is much less than min(d_out, d_in). The update is applied as:

snippet

W_effective = W_pretrained + α × (B × A)

where α is the scaling factor controlling LoRA strength.

During training:

The pretrained weights W are frozen
Only A and B are trained, which requires a tiny fraction of the compute of full fine-tuning
A typical LoRA file is 10-100MB vs. gigabytes for the full model

During inference:

Load the base model normally
Load the LoRA file alongside it
Set a strength multiplier (0 = ignore LoRA, 1 = full LoRA strength, higher = exaggerated effect)

LoRA training data requirements

Style LoRA: 20-50 images in the target style, all with similar aesthetic. Train for 1000-2000 steps at a low learning rate. Result: any prompt rendered in that style.

Subject/character LoRA: 10-30 images of the subject from different angles, lighting conditions, and contexts. This is the most common use case—consistent generation of a specific person, character, or product.

Concept LoRA: a few exemplar images plus a training token. Used for concepts the base model handles poorly.

Other fine-tuning approaches

DreamBooth: originally a full fine-tuning method (updates all weights) using 3-30 subject images and a "class preservation loss" to prevent catastrophic forgetting of other concepts. Later adapted to work with LoRA. More powerful than bare LoRA for unusual subjects.

Textual Inversion: trains only the embedding vector for a new text token, leaving all model weights unchanged. Very lightweight (a few KB) but can only learn concepts expressible through prompt conditioning, not architectural changes.

Full fine-tuning: updates all model weights on a large dataset. Produces the most capable custom models but requires serious GPU infrastructure. Used by companies building specialized image generation for specific domains (medical imaging, product photography, etc.).

Running a diffusion pipeline in Python

The diffusers library from Hugging Face provides unified APIs across SD models, Flux, and others.

SDXL with DPM++ 2M Karras sampler:

python

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

# Load the pipeline — downloads ~7GB of weights on first run
pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)

# Swap to DPM++ 2M Karras sampler for better quality
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
    pipeline.scheduler.config,
    algorithm_type="dpmsolver++",
    use_karras_sigmas=True,
)
pipeline.to("cuda")

result = pipeline(
    prompt=(
        "A photorealistic portrait of a red fox in a misty forest at dawn, "
        "shallow depth of field, golden hour lighting, National Geographic style"
    ),
    negative_prompt="ugly, blurry, low quality, extra fingers, deformed",
    num_inference_steps=25,
    guidance_scale=7.5,
    height=1024,
    width=1024,
    generator=torch.Generator("cuda").manual_seed(42),
)

result.images[0].save("sdxl_output.png")

Flux.1-dev with flow matching:

python

from diffusers import FluxPipeline
import torch

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)
# Offload model layers to CPU when not in use — handles VRAM pressure
pipeline.enable_model_cpu_offload()

result = pipeline(
    prompt=(
        "A photorealistic portrait of a red fox in a misty forest at dawn, "
        "shallow depth of field, golden hour lighting, National Geographic style"
    ),
    # Flux doesn't use negative prompts reliably with flow matching
    num_inference_steps=20,   # Flow matching needs fewer steps
    guidance_scale=3.5,       # Flux uses lower CFG scale — don't use 7.5 here
    height=1024,
    width=1024,
    generator=torch.Generator("cpu").manual_seed(42),
)

result.images[0].save("flux_output.png")

Key differences between these two calls:

Flux uses torch.bfloat16 (not float16) — more numerically stable for transformer architectures
Flux uses guidance_scale=3.5 vs. SDXL's 7.5 — different training means different optimal CFG
Flux uses 20 steps vs. SDXL's 25 — flow matching converges faster
Negative prompts are omitted for Flux — they work inconsistently with flow matching

Glossary

Term	One-line meaning
DDPM	Denoising Diffusion Probabilistic Model — original 1000-step formulation
Flow matching	Alternative training objective with straight-line noise-to-image paths; fewer inference steps
Forward process	Training-time process of adding noise to real images step by step
Reverse process	Inference-time process of removing noise step by step to generate images
Noise schedule	How much noise is added at each timestep (linear, cosine, flow matching)
Sampler / scheduler	Algorithm for each denoising step (DDIM, DPM++ 2M Karras, Euler A, LCM)
Latent space	Compressed representation where diffusion runs (8× smaller per spatial dimension)
VAE	Variational Autoencoder — encodes images to latent space; decodes latents to pixels
CLIP	Text encoder trained for image-text alignment; 77-token limit
T5	Language-model text encoder; handles long, complex prompts better than CLIP
Cross-attention	Mechanism by which image spatial features attend to text token embeddings
CFG scale	Classifier-free guidance strength — how strongly the model follows the prompt
U-Net	Convolutional encoder-decoder backbone with skip connections; used in SD 1.x–SDXL
DiT	Diffusion Transformer — ViT-style transformer on latent patches; used in SD3, Flux
LoRA	Low-Rank Adaptation — lightweight fine-tuning adapter (10-100MB)
DreamBooth	Fine-tuning technique for learning specific subjects with class preservation loss
Negative prompt	Text that CFG pushes generation away from
Seed	Random integer initializing starting noise — same seed = same composition
ControlNet	Extension adding structural control inputs (edges, depth, pose) to SD models
Inpainting	Generating within a masked region of an existing image
img2img	Using an existing image as a starting point rather than pure noise
Mode collapse	GAN failure where generator produces only narrow range of outputs
Skip connections	U-Net connections passing encoder outputs directly to corresponding decoder layers

Bottom line

Diffusion models work by learning to reverse a noising process. The core components: a noise schedule that defines how training images are corrupted, a denoising network (U-Net or transformer) that learns to predict and remove noise, a text encoder that converts prompts to embeddings, cross-attention that injects those embeddings at every denoising step, a VAE that moves computation into compressed latent space, and CFG that amplifies prompt adherence at inference.

The architecture has evolved from convolutional U-Nets through transformer DiTs to flow matching, each generation improving quality and efficiency. The model families (DALL-E 3, Stable Diffusion, Flux, Imagen, Midjourney) differ in architecture, training data, openness, and optimization target—closed-API quality vs. open-weights flexibility vs. aesthetic fine-tuning. The practical controls—steps, CFG scale, seed, negative prompts—are direct expressions of the underlying math, not arbitrary sliders.

Model cards, licenses, and technical details change with each release. For production use, always check the model card for the specific checkpoint you are deploying, and verify the recommended steps and CFG settings for that model's training objective.

Why diffusion? The intuition before the math

Before the equations, it helps to build intuition for why this approach works at all.

The ink-in-water analogy

Why not just use GANs?

The forward process: adding noise during training

The forward process is not what happens when you generate an image—it is what happens during training. Understanding it is essential for understanding what the model is actually learning.

What happens mathematically

The math is structured so you can compute x_t (the noisy version at any timestep t) directly from x₀ in a single calculation, without running through all intermediate steps. Concretely:

snippet

x_t = √(ᾱ_t) · x₀ + √(1 - ᾱ_t) · ε

snippet

L = E[||ε - ε_θ(x_t, t, c)||²]

The model is learning: "given this noisy image and this amount of noise, what noise was added?" Running this in reverse lets you go from noise to image.

Noise schedules: how much noise at each step

The noise schedule determines how quickly noise accumulates across the T timesteps. Different schedules have significantly different properties:

The reverse process: generating images at inference

When you generate an image, the forward process doesn't happen. You start from pure Gaussian noise x_T and run the reverse process to produce x₀—your generated image.

The denoising loop

The basic generation loop:

Sample x_T from a standard Gaussian distribution — pure random noise
For t = T, T−1, ..., 1:
- Predict noise: ε̂ = ε_θ(x_t, t, text_embedding)
- Estimate the clean image x₀ from x_t and ε̂
- Compute x_ by taking a step toward the estimated x₀ (with appropriate noise depending on the sampler)
Return x₀ as the generated image

Step 2c is where different samplers diverge. The mathematical form of "take a step toward x₀" varies, and this is what drives the practical differences between sampling algorithms.

Samplers: the algorithms that govern each step

The sampler (or scheduler) defines the specific update rule for going from x_t to x_. Understanding the main options helps you choose the right one for your use case.

Step count guidelines

Steps	Quality	Use case
4-8	Rough but recognizable	Real-time preview, rapid iteration
15-20	Good quality with modern samplers	Most production tasks with DPM++ or flow matching
25-35	High quality	Fine-tuned outputs, high-stakes generation
40-50	Marginal improvement	Specific cases where 30 steps has artifacts
1000 (DDPM)	Historical baseline	Research only

The right number depends on the sampler and architecture. With flow matching models (Flux, SD3), 20 steps is typically excellent. With DDPM-based models and DPM++ 2M Karras, 25-30 is a safe default.

How text conditioning works

Text-to-image means the model generates images corresponding to a text prompt. For this to work, the text must influence every denoising step—not just the initialization.

Text encoders: converting words to vectors

Cross-attention: the mechanism that joins text to image

Once you have text embeddings (a sequence of vectors, one per token), they must influence the denoising network at every step. The mechanism is cross-attention.

In a cross-attention layer inside the denoising network:

Image features at each spatial location form the queries: "what am I looking at and what do I need?"
Text token embeddings form the keys and values: "what information is available from the prompt?"
Attention weights determine how much each image location should attend to each text token
The output is image features updated to reflect relevant textual context

Why prompt length and word order matter

Behavior with long vs. short prompts depends heavily on the text encoder:

CLIP encoders have a 77-token limit (including special tokens). Anything beyond is truncated. The most important terms should appear early in the prompt.
T5-based models handle much longer prompts (256+ tokens) and can reason about complex multi-part descriptions. Writing almost in prose sentences works.
Word order matters for attention: terms earlier in the prompt tend to receive higher attention weights in practice, especially with CLIP.
Emphasis syntax ((term:1.3) in Stable Diffusion UIs) artificially boosts the attention weight contribution for specific tokens—useful when the model underweights a key concept.

Latent diffusion and the VAE

The variational autoencoder (VAE)

The VAE architecture:

Encoder: takes a high-resolution image (e.g., 512×512×3 pixels) and compresses it to a small latent tensor (e.g., 64×64×4 values)
Decoder: takes a latent tensor and expands it back to full-resolution pixels

The 8× spatial downsampling factor (used in SD 1.x through SDXL) is standard: 512×512 pixels become 64×64 latents, 1024×1024 pixels become 128×128 latents.

The VAE is trained separately from the diffusion model, using:

Reconstruction loss: the decoded image should match the original pixel for pixel
Perceptual loss: the decoded image should be perceptually similar (matching activations in a pre-trained vision network, not just pixel values)
KL divergence term: the latent distribution should be smooth and continuous so the decoder can handle novel latents from the diffusion process

What latent space actually is

Compact: representing images with far fewer numbers
Smooth: nearby points in latent space decode to visually similar images; there are no sharp discontinuities
Organized: the encoder learns to cluster related concepts spatially—not in labeled, human-interpretable regions, but in a statistically organized way that the decoder can navigate

Why the 8× factor is standard

The 8× spatial compression (with 4 or 16 channels) was found empirically to balance:

Computational efficiency: 64×64 latents have 64× fewer spatial positions than 512×512 pixels
Reconstruction quality: the VAE can still reconstruct fine detail at 8× compression; 16× compression loses too much
Latent expressiveness: enough spatial resolution to represent complex compositional arrangements

Flux and SD3 use 16-channel latents at the same 8× spatial compression, allowing more information per spatial position without changing the compute profile drastically.

Classifier-free guidance (CFG)

The training trick

Conditional generation: how to generate images that match a text description
Unconditional generation: how to generate plausible images with no text guidance at all

The dropout doesn't hurt—it actually regularizes the model and ensures it can generate coherently both with and without conditioning.

The sampling trick

At inference, for every denoising step, the model makes two noise predictions:

Conditional: the noise prediction given your actual text prompt
Unconditional: the noise prediction given null/empty conditioning

These are combined using the CFG formula:

snippet

ε_final = ε_unconditional + scale × (ε_conditional − ε_unconditional)

Substituting values:

When scale = 1: ε_final = ε_conditional — pure conditional prediction, no CFG amplification
When scale = 7: the conditional direction is amplified 6× beyond the unconditional baseline
When scale = 0: ε_final = ε_unconditional — completely ignores your prompt

The formula extrapolates beyond the conditional prediction, pulling the generated image strongly in the direction your prompt implies.

What CFG scale does to your images

CFG Scale	Effect	When to use
1-3	Very varied, sometimes ignores prompt	Creative exploration, background generation
4-6	Moderate adherence, naturalistic results	Soft styling, organic-looking images
7-8	Standard — good prompt following	Most generation tasks
9-12	Strong adherence, higher contrast, sharper	When prompt compliance is critical
13+	Oversaturation, artifacts common	Rarely beneficial

Architecture evolution: U-Net, DiT, and flow matching

The denoising network's architecture has evolved substantially since the original DDPM paper. Each generation brought meaningful improvements in quality, scalability, and efficiency.

U-Net era (2021-2023): SD 1.x through SDXL

The U-Net is a convolutional neural network with an hourglass shape:

Encoder path (downsampling): series of convolution + downsampling blocks that progressively reduce spatial resolution while increasing channel depth. Captures increasingly abstract, global features.
Bottleneck: the lowest-resolution, highest-channel representation where global image understanding lives.
Decoder path (upsampling): series of convolution + upsampling blocks that restore spatial resolution, recovering fine-grained detail.
Skip connections: outputs from corresponding encoder layers are concatenated with decoder inputs, allowing high-resolution spatial detail to bypass the bottleneck directly.

Stable Diffusion 1.4/1.5, SD 2.x, and SDXL all use U-Net backbones. SDXL adds a second, smaller U-Net as a refiner that runs on the latent after the main U-Net to add fine detail.

DiT era (2022-2024): diffusion transformers

The DiT (Diffusion Transformer) architecture, introduced in research around 2022-2023, replaces the convolutional U-Net with a Vision Transformer-style architecture:

The latent tensor is divided into small patches (e.g., 2×2 patches of the 64×64 latent produce a 32×32 grid of tokens, plus time and text tokens)
Each patch becomes a token in a sequence
Standard transformer blocks (multi-head self-attention + feedforward) process the full sequence
Timestep and conditioning are injected via adaptive layer norm (adaLN-Zero)
Final output is reshaped back into the latent tensor shape

Why DiT improves on U-Net:

Global attention: self-attention allows any patch to directly attend to any other patch at any depth, not just nearby convolution neighbors. Global spatial relationships become learnable.
Scaling laws: transformers scale more predictably than CNNs with more compute, larger models, and more data. Scaling up a DiT reliably improves quality.
Architectural simplicity: no encoder-decoder asymmetry, no skip connections required.

The quantitative result (from the original DiT paper): larger DiT models achieve lower FID scores on image generation benchmarks with predictable, smooth scaling curves.

Stable Diffusion 3, SD 3.5, and Flux use transformer-based backbones with some architectural variations specific to each.

Flow matching era (2023-2026): Flux and SD3

The training loss for flow matching:

snippet

L = E[||v_θ(x_t, t, c) − (x_1 − x_0)||²]

Practical differences from DDPM:

Fewer inference steps: 20 steps is typically sufficient for Flux.1-dev vs. 25-30 for DPM++ on DDPM models
More uniform training signal: DDPM's loss concentrates too much on high-noise timesteps; flow matching distributes learning more evenly
Better with large models: the simpler objective scales more cleanly

Limitation: flow matching models handle negative prompts inconsistently, since the training didn't involve the same kind of conditioning dropout that makes CFG work naturally in DDPM.

Major model families

Understanding what distinguishes each major model family helps you choose the right tool and interpret capability comparisons.

DALL-E 3 (OpenAI)

Architecture: closed source, not publicly documented in detail. Uses diffusion-based generation with tight GPT-4V integration.

Text rendering: significantly better than earlier models at placing legible, correctly spelled text within images.

Safety: extensive content filtering on inputs and outputs, making it more conservative than open-weights alternatives.

Access: API through OpenAI (also powers ChatGPT image generation), no public model weights.

Best for: general-purpose professional generation where strong prompt following matters and custom fine-tuning is not required.

Stable Diffusion family (Stability AI)

Architecture evolution: SD 1.x/2.x (U-Net + CLIP), SDXL (larger U-Net + dual CLIP), SD3/SD3.5 (transformer + flow matching + T5).

Key differentiator — open weights and ecosystem: any checkpoint can be downloaded, run locally, fine-tuned, extended, and modified. This spawned an enormous ecosystem:

LoRA marketplaces: thousands of style and subject fine-tunes
ControlNet: adds structural control inputs (edge maps, depth maps, body poses) for precise composition guidance
Community tools: ComfyUI (node-based workflow), Automatic1111, Forge, InvokeAI
img2img and inpainting: well-supported pipelines for editing existing images

Best for: custom workflows, production pipelines that need full stack control, specific style fine-tunes, integration with ControlNet for structured outputs.

Flux (Black Forest Labs)

Key differentiators:

State-of-the-art quality: Flux.1 models produced the highest-quality outputs among open-weights systems at their release in mid-2024 and remained competitive through 2026
Text rendering: significantly better than earlier SD models at placing readable, correctly spelled text within images
Flow matching efficiency: 20 steps with Flux.1-dev typically produces excellent results
Open weights: Flux.1-dev available on Hugging Face under a non-commercial license; Flux.1-schnell under Apache 2.0

Best for: highest-quality open-weights generation, text-in-image tasks, production workflows needing state-of-the-art quality without closed-API dependency.

Imagen 3 (Google DeepMind)

Architecture: cascade diffusion (generates at progressively higher resolutions), T5-XXL text encoder, transformer-based backbone.

Key differentiators:

T5-XXL text encoding: exceptional at following long, complex, prose-style prompts. You can write in natural language sentences rather than keyword lists.
Cascade architecture: first generates a low-resolution image, then uses separate upsampler models to add detail at each scale—similar to DALL-E 2's approach
Strong text rendering: competitive with DALL-E 3 for reading text correctly in images
Native Google integration: built into Gemini image generation, available via Vertex AI

Access: Google API only, no open weights.

Best for: complex multi-clause prompts, within Google's product ecosystem (Gemini, Workspace, Vertex).

Midjourney

Architecture: proprietary, not documented. Likely transformer + diffusion with extensive aesthetic fine-tuning.

Behavior vs. other models:

Strong aesthetic opinions: outputs have a distinctive "Midjourney look" even across diverse prompts
Less prompt-literal: it interprets and aestheticizes rather than following literally
No API for integration into custom workflows; Discord bot interface or direct web access only
No fine-tuning or custom model weights for users

Best for: creative, artistic image generation where aesthetic quality and visual impact matter more than precise prompt adherence or pipeline integration.

Practical controls: what every parameter actually does

These are the controls available in virtually every image generation interface, explained mechanistically.

Steps

The number of denoising iterations. More steps means more refinement, up to diminishing returns that vary by model and sampler.

Start with 20 for flow matching models (Flux, SD3) and 25 for DDPM models. Increase to 35 if you see incomplete detail. Going above 50 almost never helps.

CFG scale

For Flux: start at 3.5. For SD models: start at 7. These defaults reflect the different training objectives.

Seed

Negative prompts

Text that CFG pushes generation away from. Common patterns:

ugly, blurry, low quality, jpeg artifacts — suppresses image degradation
text, watermark, logo — avoids unwanted overlays
extra fingers, deformed hands, extra limbs — common anatomy fixes for SD 1.x/2.x (less necessary with newer models)
oversaturated, overexposed — tones down excessive CFG effects

Note for flow matching models: negative prompts work inconsistently with Flux and SD3 because the training objective differs from DDPM. Results vary; test before relying on them.

Resolution and aspect ratio

LoRA and fine-tuning

What LoRA is

snippet

W_effective = W_pretrained + α × (B × A)

where α is the scaling factor controlling LoRA strength.

During training:

The pretrained weights W are frozen
Only A and B are trained, which requires a tiny fraction of the compute of full fine-tuning
A typical LoRA file is 10-100MB vs. gigabytes for the full model

During inference:

Load the base model normally
Load the LoRA file alongside it
Set a strength multiplier (0 = ignore LoRA, 1 = full LoRA strength, higher = exaggerated effect)

LoRA training data requirements

Style LoRA: 20-50 images in the target style, all with similar aesthetic. Train for 1000-2000 steps at a low learning rate. Result: any prompt rendered in that style.

Concept LoRA: a few exemplar images plus a training token. Used for concepts the base model handles poorly.

Other fine-tuning approaches

Running a diffusion pipeline in Python

The diffusers library from Hugging Face provides unified APIs across SD models, Flux, and others.

SDXL with DPM++ 2M Karras sampler:

python

from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

# Load the pipeline — downloads ~7GB of weights on first run
pipeline = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    use_safetensors=True,
    variant="fp16",
)

# Swap to DPM++ 2M Karras sampler for better quality
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
    pipeline.scheduler.config,
    algorithm_type="dpmsolver++",
    use_karras_sigmas=True,
)
pipeline.to("cuda")

result = pipeline(
    prompt=(
        "A photorealistic portrait of a red fox in a misty forest at dawn, "
        "shallow depth of field, golden hour lighting, National Geographic style"
    ),
    negative_prompt="ugly, blurry, low quality, extra fingers, deformed",
    num_inference_steps=25,
    guidance_scale=7.5,
    height=1024,
    width=1024,
    generator=torch.Generator("cuda").manual_seed(42),
)

result.images[0].save("sdxl_output.png")

Flux.1-dev with flow matching:

python

from diffusers import FluxPipeline
import torch

pipeline = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16,
)
# Offload model layers to CPU when not in use — handles VRAM pressure
pipeline.enable_model_cpu_offload()

result = pipeline(
    prompt=(
        "A photorealistic portrait of a red fox in a misty forest at dawn, "
        "shallow depth of field, golden hour lighting, National Geographic style"
    ),
    # Flux doesn't use negative prompts reliably with flow matching
    num_inference_steps=20,   # Flow matching needs fewer steps
    guidance_scale=3.5,       # Flux uses lower CFG scale — don't use 7.5 here
    height=1024,
    width=1024,
    generator=torch.Generator("cpu").manual_seed(42),
)

result.images[0].save("flux_output.png")

Key differences between these two calls:

Flux uses torch.bfloat16 (not float16) — more numerically stable for transformer architectures
Flux uses guidance_scale=3.5 vs. SDXL's 7.5 — different training means different optimal CFG
Flux uses 20 steps vs. SDXL's 25 — flow matching converges faster
Negative prompts are omitted for Flux — they work inconsistently with flow matching

Glossary

Term	One-line meaning
DDPM	Denoising Diffusion Probabilistic Model — original 1000-step formulation
Flow matching	Alternative training objective with straight-line noise-to-image paths; fewer inference steps
Forward process	Training-time process of adding noise to real images step by step
Reverse process	Inference-time process of removing noise step by step to generate images
Noise schedule	How much noise is added at each timestep (linear, cosine, flow matching)
Sampler / scheduler	Algorithm for each denoising step (DDIM, DPM++ 2M Karras, Euler A, LCM)
Latent space	Compressed representation where diffusion runs (8× smaller per spatial dimension)
VAE	Variational Autoencoder — encodes images to latent space; decodes latents to pixels
CLIP	Text encoder trained for image-text alignment; 77-token limit
T5	Language-model text encoder; handles long, complex prompts better than CLIP
Cross-attention	Mechanism by which image spatial features attend to text token embeddings
CFG scale	Classifier-free guidance strength — how strongly the model follows the prompt
U-Net	Convolutional encoder-decoder backbone with skip connections; used in SD 1.x–SDXL
DiT	Diffusion Transformer — ViT-style transformer on latent patches; used in SD3, Flux
LoRA	Low-Rank Adaptation — lightweight fine-tuning adapter (10-100MB)
DreamBooth	Fine-tuning technique for learning specific subjects with class preservation loss
Negative prompt	Text that CFG pushes generation away from
Seed	Random integer initializing starting noise — same seed = same composition
ControlNet	Extension adding structural control inputs (edges, depth, pose) to SD models
Inpainting	Generating within a masked region of an existing image
img2img	Using an existing image as a starting point rather than pure noise
Mode collapse	GAN failure where generator produces only narrow range of outputs
Skip connections	U-Net connections passing encoder outputs directly to corresponding decoder layers

Why diffusion? The intuition before the math

The ink-in-water analogy

Why not just use GANs?

The forward process: adding noise during training

What happens mathematically

Noise schedules: how much noise at each step

The reverse process: generating images at inference

The denoising loop

Samplers: the algorithms that govern each step

Step count guidelines

How text conditioning works

Text encoders: converting words to vectors

Cross-attention: the mechanism that joins text to image

Why prompt length and word order matter

Latent diffusion and the VAE

The variational autoencoder (VAE)

What latent space actually is

Why the 8× factor is standard

Classifier-free guidance (CFG)

The training trick

The sampling trick

What CFG scale does to your images

Architecture evolution: U-Net, DiT, and flow matching

U-Net era (2021-2023): SD 1.x through SDXL

DiT era (2022-2024): diffusion transformers

Flow matching era (2023-2026): Flux and SD3

Major model families

DALL-E 3 (OpenAI)

Stable Diffusion family (Stability AI)

Flux (Black Forest Labs)

Imagen 3 (Google DeepMind)

Midjourney

Practical controls: what every parameter actually does

Steps

CFG scale

Seed

Negative prompts

Resolution and aspect ratio

LoRA and fine-tuning

What LoRA is

LoRA training data requirements

Other fine-tuning approaches

Running a diffusion pipeline in Python

Glossary

Bottom line

Read next:

Why diffusion? The intuition before the math

The ink-in-water analogy

Why not just use GANs?

The forward process: adding noise during training

What happens mathematically

Noise schedules: how much noise at each step

The reverse process: generating images at inference

The denoising loop

Samplers: the algorithms that govern each step

Step count guidelines

How text conditioning works

Text encoders: converting words to vectors

Cross-attention: the mechanism that joins text to image

Why prompt length and word order matter

Latent diffusion and the VAE

The variational autoencoder (VAE)

What latent space actually is

Why the 8× factor is standard

Classifier-free guidance (CFG)

The training trick

The sampling trick

What CFG scale does to your images

Architecture evolution: U-Net, DiT, and flow matching

U-Net era (2021-2023): SD 1.x through SDXL

DiT era (2022-2024): diffusion transformers

Flow matching era (2023-2026): Flux and SD3

Major model families

DALL-E 3 (OpenAI)

Stable Diffusion family (Stability AI)

Flux (Black Forest Labs)

Imagen 3 (Google DeepMind)

Midjourney

Practical controls: what every parameter actually does

Steps