If you've followed AI developments in 2025-2026, you've likely encountered the term world models—systems that don't just process language but learn to simulate reality itself. From Odyssey's Starchild-1 generating synchronized audio and video in real-time to Google's Genie 2 creating playable 3D environments from single images, world models represent a fundamental shift in what AI systems can do.
This guide explains what world models are, how they work, why they matter, and profiles the leading systems defining this space.
What is a world model?
A world model is a neural network that understands the dynamics of the physical world—including physics, spatial properties, object permanence, and causality—and can simulate how environments evolve over time.
The core idea: rather than simply memorizing patterns, world models learn internal representations of how the world works. Given current observations (video frames, sensor data, images) and potential actions, they predict what happens next.
This mirrors how humans navigate the world. When you see a ball thrown, you don't recalculate physics from first principles—your brain has an internal model that instantly predicts where the ball will land. World models give machines this same capability.
World models vs. language models
| Aspect | Language Models | World Models |
|---|---|---|
| Input | Text tokens | Video, images, sensor data, actions |
| Output | Next token prediction | Future state prediction (video frames, 3D scenes) |
| Understanding | Linguistic patterns, reasoning | Physical dynamics, spatial relationships |
| Training data | Text corpora | Video datasets, simulation data |
| Applications | Chat, writing, code | Robotics, autonomous vehicles, video generation |
Language models predict what word comes next. World models predict what happens next in physical space.
How world models work
Modern world models typically follow a three-stage architecture:
1. Perception: encoding observations
Raw sensory data—video frames, lidar scans, camera feeds—gets compressed into compact latent representations. This step extracts meaningful features while discarding irrelevant pixel-level details.
Many systems use vision transformers or specialized encoders trained on massive video datasets. The goal is to create representations that capture the semantic content of scenes: objects, their positions, relationships, and properties.
2. Prediction: simulating the future
The prediction component takes current latent states and—crucially—potential actions, then forecasts future states. This is where the "world modeling" happens.
Two dominant architectures have emerged:
Autoregressive models generate future frames one at a time, each conditioned on all previous frames. This is how systems like Starchild-1 operate—predicting the next audio-video state from the history of observations and inputs.
Joint Embedding Predictive Architecture (JEPA), championed by Yann LeCun at Meta, predicts embeddings rather than raw pixels. The system learns to predict the representation of the next state in latent space, avoiding the computational cost of generating high-resolution outputs.
3. Generation: rendering outputs
For applications requiring visual outputs, the predicted latent states get decoded back into pixels, 3D geometry, or audio waveforms. This often uses diffusion models, GANs, or specialized decoders.
Some systems—like Tencent's HY-World 2.0—output explicit 3D representations (meshes, Gaussian splats) rather than video frames, enabling persistent, editable environments.
Starchild-1: the first multimodal world model
Starchild-1 from Odyssey ML represents a significant milestone: the world's first multimodal world model that generates synchronized audio and video in real-time while responding to continuous user input.
What makes Starchild-1 different
Traditional world models generate only visual outputs. Starchild-1 treats sound as an integral component of world simulation. When you see a door close in the generated video, you hear it close simultaneously. When rain falls, you hear rain sounds—all generated in sync, not added in post-processing.
Key capabilities
Real-time multimodal generation: The system autoregressively produces synchronized audio and video simultaneously, distinguishing it from offline video generation systems that process entire clips before output.
Interactive responsiveness: Users can stream text, speech, and action inputs that dynamically alter both visuals and sounds being generated. The world responds to you as you interact with it, rather than following a predetermined trajectory.
Long-horizon stability: The model maintains coherent audio-video generation over extended interactions—a significant technical achievement given how errors typically compound in autoregressive systems.
Technical architecture
Starchild-1 functions as a causal multimodal world model that predicts the next audio-video state based on past observations and streaming inputs. Key innovations include:
- Causal distillation pipeline: Converts a bidirectional foundation model into a real-time autoregressive system while preserving synchronized generation
- Asynchronous KV-cache architecture: Accommodates audio and video's different temporal frequencies (audio at 44.1kHz, video at 24-30fps)
- Rollout adaptation strategy: Addresses the distinct characteristics of each modality during extended generation
The research philosophy
Odyssey's approach emphasizes that greater intelligence emerges from exploring and learning directly from the world itself through multimodal interaction—not just from text. Sound provides crucial information about physics, materials, and causality that vision alone misses.
Google DeepMind Genie 2: interactive 3D worlds from images
Genie 2 from Google DeepMind generates interactive, playable 3D environments from a single image and text description.
Capabilities
Feed Genie 2 an image—concept art, a photograph, a drawing—with a text prompt like "A cute humanoid robot in the woods," and it creates a real-time 3D world you can explore. The system handles:
- Object physics and collisions: Objects behave realistically when interacted with
- NPC behavior: Characters in the scene act autonomously
- Perspective changes: The world remains consistent as you move through it
- Environmental effects: Lighting, water, particles all respond appropriately
Generated worlds remain consistent for approximately 10-60 seconds of exploration—impressive for a research system, though not yet suitable for extended gameplay.
Out-of-distribution generalization
Genie 2's standout feature is its ability to handle inputs it wasn't explicitly trained on. Concept art, children's drawings, and unusual visual styles can all become playable environments. This generalization capability makes it valuable for rapid prototyping in game development and creative applications.
Current status
Since its December 2024 announcement, Genie 2 has been accessible only to select testers. It remains a research and demo system from Google DeepMind, not a freely available product—but it's increasingly shown at conferences as a building block for games, simulation, and agent training.
NVIDIA Cosmos: world foundation models for physical AI
NVIDIA Cosmos is a platform of world foundation models (WFMs) purpose-built for autonomous vehicles and robotics.
The Cosmos family
The platform includes three main components:
- Cosmos-Predict: Forecasts future scenarios given current observations and potential actions
- Cosmos-Transfer: Transforms data between modalities and domains
- Cosmos-Reason: Provides physical AI reasoning capabilities
Recent releases (through early 2026) include Cosmos Transfer 2.5, Cosmos Predict 2.5, and Cosmos Reason 2, enhancing synthetic data generation, scenario prediction, and physical AI reasoning.
Industry adoption
Cosmos has gained traction with major robotics and AV companies:
- Agility Robotics and Figure AI use it for humanoid robot training
- Uber and Waabi apply it to autonomous vehicle development
- Virtual Incision explores deployment in surgical robots
- Skild AI and Foretellix leverage it for richer training data generation
The value proposition
NVIDIA positions Cosmos as solving the data bottleneck in physical AI. Training robots and autonomous vehicles requires massive datasets of diverse scenarios—including rare, safety-critical edge cases. Cosmos generates synthetic training data at scale, creating scenarios that would be dangerous, expensive, or impossible to capture in the real world.
Meta V-JEPA 2: self-supervised video understanding
V-JEPA 2 from Meta represents Yann LeCun's vision for world models that learn through self-supervised prediction in latent space.
Architecture
V-JEPA 2 has two main components:
Encoder: Takes raw video and outputs embeddings that capture useful semantic information about the state of the observed world.
Action-conditioned predictor: A 300M-parameter transformer that autoregressively predicts the representation of the next video frame conditioned on an action and previous states.
Training approach
V-JEPA 2 is pre-trained on over 1 million hours of internet video without external supervision. The system learns by predicting masked portions of video in embedding space—understanding motion, object permanence, and physical dynamics through pure observation.
Capabilities
V-JEPA 2 achieves state-of-the-art performance on visual understanding and prediction benchmarks. More importantly, it enables zero-shot robot control in new environments—the model's understanding of physical dynamics transfers to manipulation tasks it was never explicitly trained on.
As LeCun puts it: "V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning."
World Labs Marble: spatial intelligence
World Labs, founded by AI pioneer Fei-Fei Li, focuses on spatial intelligence—the ability to perceive, generate, and interact with 3D environments.
The Marble model
World Labs' first commercial product, Marble (launched November 2025), creates full 3D environments from text prompts, images, or video snippets. Unlike video-based world models that output flat pixel sequences, Marble generates actual 3D geometry—interior spaces, exterior landscapes, and navigable environments.
The spatial intelligence thesis
Li argues that spatial intelligence is AI's next frontier. Her manifesto states: "Spatial intelligence will transform how we create and interact with real and virtual worlds—revolutionizing storytelling, creativity, robotics, scientific discovery, and beyond."
The distinction is important: world models that output video give you something to watch. Spatial intelligence gives you somewhere to go.
Funding and momentum
World Labs raised $1 billion in February 2026 from AMD, Autodesk, Emerson Collective, Fidelity, NVIDIA, and Sea—signaling serious commercial interest in spatial AI applications.
Tencent HY-World 2.0: 3D assets over video
HY-World 2.0 from Tencent Hunyuan takes a different approach: instead of generating video, it outputs persistent 3D assets—meshes, Gaussian splats, and point clouds that can be edited and imported into engines like Unity, Unreal, and Blender.
The core argument
Tencent's team contrasts their approach with video-only world models: videos are hard to edit, duration-limited, and can flicker across views. Explicit 3D representations are persistent, view-consistent, and can be rendered in real-time on consumer GPUs after a one-time generation cost.
They frame it as "building a playable world" versus "watching a movie that ends."
WorldMirror 2.0
The currently available component, WorldMirror 2.0, handles world reconstruction—turning multi-view images or casual video into 3D scenes in a single forward pass. It estimates depth, surface normals, camera parameters, point clouds, and Gaussian splatting attributes simultaneously.
The full world generation pipeline (text/single image → navigable 3D) is on their roadmap but not yet released.
Wayve GAIA: world models for autonomous driving
GAIA-1 and GAIA-2 from UK-based Wayve are world models specifically designed for autonomous vehicle development.
GAIA-1
Trained on 4,700 hours of driving video collected in London (2019-2023), GAIA-1 generates lifelike driving scenarios with precise control over both ego-vehicle actions and environmental elements. It takes video, text, and action inputs as token sequences.
GAIA-2
The successor focuses on controllable multi-camera video generation for creating rare and safety-critical scenarios at scale. AV development requires exposure to dangerous situations—near-misses, unusual pedestrian behavior, adverse weather conditions—that are difficult to capture safely in real-world testing. GAIA-2 generates these scenarios synthetically.
Runway GWM-1: real-time conversational video
Runway built their General World Model (GWM-1) to power Runway Characters—real-time conversational video agents from single reference images.
Performance
GWM-1 achieves approximately 24fps HD video generation with about 37ms effective model time per frame. Server-side turn latency (from end of speech to first response frame) runs around 1.75 seconds in measured sessions.
Technical approach
Key optimizations include:
- Distribution Matching Distillation to reduce denoising steps
- Overlapped diffusion and decode pipelines
- KV-cache management for autoregressive video
- CUDA Graphs to eliminate kernel launch overhead
The result is interactive video agents that can respond to user input in real-time—useful for support, education, and entertainment applications.
Why world models matter
Beyond language: grounding AI in physics
Large language models achieve remarkable reasoning capabilities but operate entirely in linguistic space. They can describe how a ball falls but have no internal model of gravity. World models provide the physical grounding that enables AI systems to understand not just what things are called, but how they actually behave.
The robotics training problem
Training robots in the physical world is slow, expensive, and potentially dangerous. Each failed grasp costs time and risks equipment damage. World models enable simulation-to-real transfer: train in simulated environments generated by world models, then deploy in the real world.
Synthetic data generation
As AI systems grow more capable, they require increasingly massive datasets. Real-world data collection has scaling limits. World models can generate unlimited synthetic training data—including rare scenarios, edge cases, and counterfactual situations that couldn't be captured otherwise.
Video generation and entertainment
Consumer applications are emerging in video generation, gaming, and interactive media. The same world models that train robots can power AI video tools, generate game environments, and create interactive entertainment experiences.
The road ahead
World models in 2026 represent early but rapid progress. Current limitations include:
- Temporal consistency: Most systems maintain coherence for seconds to minutes, not hours
- Physical accuracy: Generated physics is often plausible but not precise
- Computational cost: Real-time generation requires significant GPU resources
- Multimodal integration: Starchild-1 shows audio-video is possible; full sensory integration remains frontier research
The trajectory is clear: from text understanding to world understanding. As these systems improve, the gap between AI simulation and physical reality will continue to narrow.
Related reading
- Runway Characters: real-time conversational video agents from one image
- Tencent Hunyuan HY-World 2.0: 3D world models and WorldMirror 2.0
- What is MCP? Model Context Protocol guide
- AI benchmarks: complete guide (2026)
- LLM model parameters explained
Sources
- Odyssey ML: Introducing Starchild-1
- Google DeepMind: Genie 2 foundation world model
- NVIDIA Cosmos World Foundation Models
- Meta: Introducing V-JEPA 2
- World Labs
- Tencent HY-World 2.0 GitHub
- Wayve GAIA-1 Paper
- Runway: Building Runway Characters
- NVIDIA Glossary: World Models
- Nature: World models are AI's latest sensation
- MIT Technology Review: World models in AI
World model capabilities, APIs, and pricing evolve rapidly. Verify current specifications directly with vendors before integration decisions.