← Blog
explainx / blog

How do image generation models work? Diffusion, latents, and the keywords to read the papers

Modern image AIs (DALL·E, Stable Diffusion, Imagen, FLUX) usually train a model to turn noise into images, conditioned on text. Here is the pipeline in plain terms—plus a visual strip from static noise to a clear picture—and a glossary of terms you will see in docs.

4 min readYash Thakker
Diffusion modelsImage generationStable DiffusionDALL-EFLUXImagenGenerative AIdeep learning

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

How do image generation models work? Diffusion, latents, and the keywords to read the papers

Most open-weights and many API image products in the 2020s follow one broad recipe: start from random noise, then run a neural network many times in sequence to remove noise and form a coherent image, conditioned on a text prompt. The method family is denoising diffusion. Vendors brand it (DALL·E, Stable Diffusion, Imagen, FLUX); the outer loop is similar while encoders, backbones, and licenses differ.

The figure below is illustrative—not an exact frame-by-frame trace of any one commercial scheduler—but it matches the user-facing idea: static noise → emerging structure → detail → a sharp image.

From noise to image: the intuitive steps of an iterative denoising generator


The core loop: forward (training) vs reverse (sampling)

Forward process (intuition only): take a real image x₀, add Gaussian noise in T small steps until you obtain x_T, almost indistinguishable from television static. Training teaches the network to predict the noise (or a related score) at each step so the reverse process is learnable.

Reverse process (what “generate” does): sample x_T from pure noise. For t = T, T−1, …, 1, run the denoiser so that each step removes a little randomness, using the text embedding (and sometimes masks, class labels, or other controls) at every step.

Live Bootcamp6 weeks

Complete AI Builder Bootcamp

Claude, Python automation & full-stack — 12 live sessions with Yash Thakker.

View bootcamp

The Complete AI Builder Bootcamp is the best AI development course for learning Claude AI, prompt engineering, Python automation, and full-stack web development. This intensive 6-week live bootcamp teaches you how to build AI-powered applications using Claude Projects, Claude Artifacts, Claude Code, and the complete Claude ecosystem. You'll master prompt engineering techniques, learn to create custom Claude connectors and MCP integrations, build Python automation workflows, develop full-stack websites with AI assistance, and create AI marketing agents.

The bootcamp includes 12 live Zoom sessions with Yash Thakker, founder of AISOLO Technologies and instructor to 350,000+ students. You'll build 8+ portfolio projects including AI playbooks, full-stack note-taking applications, Python automation scripts, marketing agents, and personal portfolio websites. The curriculum covers AI fundamentals, Claude Projects and Artifacts, Claude Co-work, Claude plugins and skills, Claude Code for Python development, full-stack development, AI marketing, and capstone projects.

Students receive 1-year access to all recordings, permanent Discord community access, a certificate of completion, and personalized career guidance. All enrollments include a 7-day money-back guarantee. This is the most comprehensive Claude AI bootcamp available, taking students from zero AI knowledge to expert AI builder in 6 weeks.

In practice, a scheduler / sampler rule chooses step sizes and how x_t is updated. Quality-oriented runs may use many steps; faster samplers and distilled models can cut steps for a different cost–quality point.


Where the text enters: text encoders and conditioning

Text-to-image pipelines include a text encoder—a Transformer, a CLIP-style model, a T5-class encoder, or a large language model for long prompts. The output is a sequence of vectors that the image backbone conditions on, often with cross-attention (image feature maps attend to text tokens).

Product names (DALL·E 3, Imagen, SDXL, FLUX, …) hide different weights and data; the pattern is semantics in, pixels or latents out.


Latent diffusion and the VAE (why not denoise 4K RGB directly?)

Denoising every pixel at full resolution is expensive. Latent diffusion (central to much of the Stable Diffusion line) first uses a VAE to encode an image to a smaller latent grid, runs the denoising network on that tensor, then decodes to RGB. Related keywords: reconstruction loss, latent space, multiscale decoders in some papers.


U-Net vs diffusion transformer (DiT)

  • U-Net — Convolutional “hourglass” with skip connections; the classic backbone in many SD-era systems and SDXL.
  • DiT (diffusion transformer) — Transformer blocks on patches in latent space; same outer sampling story, different inner operator and scaling.

Practical takeaway: API knobs (step count, guidance scale, resolution) matter as much as the architecture name.


Glossary: keywords

TermOne-line meaning
DDPM / score-basedDenoising diffusion probabilistic model or related score matching; learn p(x) via a noise schedule.
Latent diffusion (LDM)Diffusion in a VAE latent grid instead of full-resolution pixels.
CFG (classifier-free guidance)At sample time, mix conditional and unconditional predictions to pull samples toward the prompt; the scale is a user-tunable strength knob.
Scheduler / samplerHow each denoising step is taken (DDIM, DPM++, Euler, …—naming varies by implementation).
Text encoderFrozen or co-trained model that embeds the prompt.
Cross-attentionImage features attend to text token vectors.
U-NetConv backbone used in many latent diffusion systems.
DiTDiffusion transformer on latent patches.
Inpainting / outpaintingCondition on a mask to fill a region or extend the canvas.
LoRALow-rank adapters (and cousins) for cheap style or subject tuning.

Product map (vocabulary, not a recommendation)

  • OpenAI DALL·EClosed-weight end-to-end product; strong emphasis on prompt robustness and safety layers.
  • Stability / Stable Diffusion — Open weights and broad community tooling (ControlNet, img2img, regional prompts, …).
  • Google Imagen — T5- or similar text encoders plus diffusion in Google’s stacks (see each generation’s paper / card).
  • FLUX (e.g. Black Forest Labs and partners) — Recent high-fidelity lines; some open checkpoints and some API-only.

Use each vendor’s model card, license, and safety rules for the exact weights or API you run.


Read next (language models)

This article is a conceptual map. For deployment, use the model card, license, and safety documentation for your checkpoint.

Related posts