← Blog
explainx / blog

What Are World Models? The AI Systems That Simulate Reality (Starchild-1 and Beyond)

World models are AI systems that learn to simulate and predict how the physical world works. Explore how they function, from Odyssey's Starchild-1 to Google Genie 2, NVIDIA Cosmos, and Meta V-JEPA 2.

12 min readYash Thakker
World modelsStarchild-1Video generationPhysical AISimulation

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

What Are World Models? The AI Systems That Simulate Reality (Starchild-1 and Beyond)

If you've followed AI developments in 2025-2026, you've likely encountered the term world models—systems that don't just process language but learn to simulate reality itself. From Odyssey's Starchild-1 generating synchronized audio and video in real-time to Google's Genie 2 creating playable 3D environments from single images, world models represent a fundamental shift in what AI systems can do.

This guide explains what world models are, how they work, why they matter, and profiles the leading systems defining this space.


What is a world model?

A world model is a neural network that understands the dynamics of the physical world—including physics, spatial properties, object permanence, and causality—and can simulate how environments evolve over time.

The core idea: rather than simply memorizing patterns, world models learn internal representations of how the world works. Given current observations (video frames, sensor data, images) and potential actions, they predict what happens next.

This mirrors how humans navigate the world. When you see a ball thrown, you don't recalculate physics from first principles—your brain has an internal model that instantly predicts where the ball will land. World models give machines this same capability.

World models vs. language models

AspectLanguage ModelsWorld Models
InputText tokensVideo, images, sensor data, actions
OutputNext token predictionFuture state prediction (video frames, 3D scenes)
UnderstandingLinguistic patterns, reasoningPhysical dynamics, spatial relationships
Training dataText corporaVideo datasets, simulation data
ApplicationsChat, writing, codeRobotics, autonomous vehicles, video generation

Language models predict what word comes next. World models predict what happens next in physical space.


How world models work

Modern world models typically follow a three-stage architecture:

1. Perception: encoding observations

Raw sensory data—video frames, lidar scans, camera feeds—gets compressed into compact latent representations. This step extracts meaningful features while discarding irrelevant pixel-level details.

Many systems use vision transformers or specialized encoders trained on massive video datasets. The goal is to create representations that capture the semantic content of scenes: objects, their positions, relationships, and properties.

2. Prediction: simulating the future

The prediction component takes current latent states and—crucially—potential actions, then forecasts future states. This is where the "world modeling" happens.

Two dominant architectures have emerged:

Autoregressive models generate future frames one at a time, each conditioned on all previous frames. This is how systems like Starchild-1 operate—predicting the next audio-video state from the history of observations and inputs.

Joint Embedding Predictive Architecture (JEPA), championed by Yann LeCun at Meta, predicts embeddings rather than raw pixels. The system learns to predict the representation of the next state in latent space, avoiding the computational cost of generating high-resolution outputs.

3. Generation: rendering outputs

For applications requiring visual outputs, the predicted latent states get decoded back into pixels, 3D geometry, or audio waveforms. This often uses diffusion models, GANs, or specialized decoders.

Some systems—like Tencent's HY-World 2.0—output explicit 3D representations (meshes, Gaussian splats) rather than video frames, enabling persistent, editable environments.


Starchild-1: the first multimodal world model

Starchild-1 from Odyssey ML represents a significant milestone: the world's first multimodal world model that generates synchronized audio and video in real-time while responding to continuous user input.

What makes Starchild-1 different

Traditional world models generate only visual outputs. Starchild-1 treats sound as an integral component of world simulation. When you see a door close in the generated video, you hear it close simultaneously. When rain falls, you hear rain sounds—all generated in sync, not added in post-processing.

Key capabilities

Real-time multimodal generation: The system autoregressively produces synchronized audio and video simultaneously, distinguishing it from offline video generation systems that process entire clips before output.

Interactive responsiveness: Users can stream text, speech, and action inputs that dynamically alter both visuals and sounds being generated. The world responds to you as you interact with it, rather than following a predetermined trajectory.

Long-horizon stability: The model maintains coherent audio-video generation over extended interactions—a significant technical achievement given how errors typically compound in autoregressive systems.

Technical architecture

Starchild-1 functions as a causal multimodal world model that predicts the next audio-video state based on past observations and streaming inputs. Key innovations include:

  • Causal distillation pipeline: Converts a bidirectional foundation model into a real-time autoregressive system while preserving synchronized generation
  • Asynchronous KV-cache architecture: Accommodates audio and video's different temporal frequencies (audio at 44.1kHz, video at 24-30fps)
  • Rollout adaptation strategy: Addresses the distinct characteristics of each modality during extended generation

The research philosophy

Odyssey's approach emphasizes that greater intelligence emerges from exploring and learning directly from the world itself through multimodal interaction—not just from text. Sound provides crucial information about physics, materials, and causality that vision alone misses.


Google DeepMind Genie 2: interactive 3D worlds from images

Genie 2 from Google DeepMind generates interactive, playable 3D environments from a single image and text description.

Capabilities

Feed Genie 2 an image—concept art, a photograph, a drawing—with a text prompt like "A cute humanoid robot in the woods," and it creates a real-time 3D world you can explore. The system handles:

  • Object physics and collisions: Objects behave realistically when interacted with
  • NPC behavior: Characters in the scene act autonomously
  • Perspective changes: The world remains consistent as you move through it
  • Environmental effects: Lighting, water, particles all respond appropriately

Generated worlds remain consistent for approximately 10-60 seconds of exploration—impressive for a research system, though not yet suitable for extended gameplay.

Out-of-distribution generalization

Genie 2's standout feature is its ability to handle inputs it wasn't explicitly trained on. Concept art, children's drawings, and unusual visual styles can all become playable environments. This generalization capability makes it valuable for rapid prototyping in game development and creative applications.

Current status

Since its December 2024 announcement, Genie 2 has been accessible only to select testers. It remains a research and demo system from Google DeepMind, not a freely available product—but it's increasingly shown at conferences as a building block for games, simulation, and agent training.


NVIDIA Cosmos: world foundation models for physical AI

NVIDIA Cosmos is a platform of world foundation models (WFMs) purpose-built for autonomous vehicles and robotics.

The Cosmos family

The platform includes three main components:

  • Cosmos-Predict: Forecasts future scenarios given current observations and potential actions
  • Cosmos-Transfer: Transforms data between modalities and domains
  • Cosmos-Reason: Provides physical AI reasoning capabilities

Recent releases (through early 2026) include Cosmos Transfer 2.5, Cosmos Predict 2.5, and Cosmos Reason 2, enhancing synthetic data generation, scenario prediction, and physical AI reasoning.

Industry adoption

Cosmos has gained traction with major robotics and AV companies:

  • Agility Robotics and Figure AI use it for humanoid robot training
  • Uber and Waabi apply it to autonomous vehicle development
  • Virtual Incision explores deployment in surgical robots
  • Skild AI and Foretellix leverage it for richer training data generation

The value proposition

NVIDIA positions Cosmos as solving the data bottleneck in physical AI. Training robots and autonomous vehicles requires massive datasets of diverse scenarios—including rare, safety-critical edge cases. Cosmos generates synthetic training data at scale, creating scenarios that would be dangerous, expensive, or impossible to capture in the real world.


Meta V-JEPA 2: self-supervised video understanding

V-JEPA 2 from Meta represents Yann LeCun's vision for world models that learn through self-supervised prediction in latent space.

Architecture

V-JEPA 2 has two main components:

Encoder: Takes raw video and outputs embeddings that capture useful semantic information about the state of the observed world.

Action-conditioned predictor: A 300M-parameter transformer that autoregressively predicts the representation of the next video frame conditioned on an action and previous states.

Training approach

V-JEPA 2 is pre-trained on over 1 million hours of internet video without external supervision. The system learns by predicting masked portions of video in embedding space—understanding motion, object permanence, and physical dynamics through pure observation.

Capabilities

V-JEPA 2 achieves state-of-the-art performance on visual understanding and prediction benchmarks. More importantly, it enables zero-shot robot control in new environments—the model's understanding of physical dynamics transfers to manipulation tasks it was never explicitly trained on.

As LeCun puts it: "V-JEPA is a step toward a more grounded understanding of the world so machines can achieve more generalized reasoning and planning."


World Labs Marble: spatial intelligence

World Labs, founded by AI pioneer Fei-Fei Li, focuses on spatial intelligence—the ability to perceive, generate, and interact with 3D environments.

The Marble model

World Labs' first commercial product, Marble (launched November 2025), creates full 3D environments from text prompts, images, or video snippets. Unlike video-based world models that output flat pixel sequences, Marble generates actual 3D geometry—interior spaces, exterior landscapes, and navigable environments.

The spatial intelligence thesis

Li argues that spatial intelligence is AI's next frontier. Her manifesto states: "Spatial intelligence will transform how we create and interact with real and virtual worlds—revolutionizing storytelling, creativity, robotics, scientific discovery, and beyond."

The distinction is important: world models that output video give you something to watch. Spatial intelligence gives you somewhere to go.

Funding and momentum

World Labs raised $1 billion in February 2026 from AMD, Autodesk, Emerson Collective, Fidelity, NVIDIA, and Sea—signaling serious commercial interest in spatial AI applications.


Tencent HY-World 2.0: 3D assets over video

HY-World 2.0 from Tencent Hunyuan takes a different approach: instead of generating video, it outputs persistent 3D assets—meshes, Gaussian splats, and point clouds that can be edited and imported into engines like Unity, Unreal, and Blender.

The core argument

Tencent's team contrasts their approach with video-only world models: videos are hard to edit, duration-limited, and can flicker across views. Explicit 3D representations are persistent, view-consistent, and can be rendered in real-time on consumer GPUs after a one-time generation cost.

They frame it as "building a playable world" versus "watching a movie that ends."

WorldMirror 2.0

The currently available component, WorldMirror 2.0, handles world reconstruction—turning multi-view images or casual video into 3D scenes in a single forward pass. It estimates depth, surface normals, camera parameters, point clouds, and Gaussian splatting attributes simultaneously.

The full world generation pipeline (text/single image → navigable 3D) is on their roadmap but not yet released.


Wayve GAIA: world models for autonomous driving

GAIA-1 and GAIA-2 from UK-based Wayve are world models specifically designed for autonomous vehicle development.

GAIA-1

Trained on 4,700 hours of driving video collected in London (2019-2023), GAIA-1 generates lifelike driving scenarios with precise control over both ego-vehicle actions and environmental elements. It takes video, text, and action inputs as token sequences.

GAIA-2

The successor focuses on controllable multi-camera video generation for creating rare and safety-critical scenarios at scale. AV development requires exposure to dangerous situations—near-misses, unusual pedestrian behavior, adverse weather conditions—that are difficult to capture safely in real-world testing. GAIA-2 generates these scenarios synthetically.


Runway GWM-1: real-time conversational video

Runway built their General World Model (GWM-1) to power Runway Characters—real-time conversational video agents from single reference images.

Performance

GWM-1 achieves approximately 24fps HD video generation with about 37ms effective model time per frame. Server-side turn latency (from end of speech to first response frame) runs around 1.75 seconds in measured sessions.

Technical approach

Key optimizations include:

  • Distribution Matching Distillation to reduce denoising steps
  • Overlapped diffusion and decode pipelines
  • KV-cache management for autoregressive video
  • CUDA Graphs to eliminate kernel launch overhead

The result is interactive video agents that can respond to user input in real-time—useful for support, education, and entertainment applications.


Why world models matter

Beyond language: grounding AI in physics

Large language models achieve remarkable reasoning capabilities but operate entirely in linguistic space. They can describe how a ball falls but have no internal model of gravity. World models provide the physical grounding that enables AI systems to understand not just what things are called, but how they actually behave.

The robotics training problem

Training robots in the physical world is slow, expensive, and potentially dangerous. Each failed grasp costs time and risks equipment damage. World models enable simulation-to-real transfer: train in simulated environments generated by world models, then deploy in the real world.

Synthetic data generation

As AI systems grow more capable, they require increasingly massive datasets. Real-world data collection has scaling limits. World models can generate unlimited synthetic training data—including rare scenarios, edge cases, and counterfactual situations that couldn't be captured otherwise.

Video generation and entertainment

Consumer applications are emerging in video generation, gaming, and interactive media. The same world models that train robots can power AI video tools, generate game environments, and create interactive entertainment experiences.


The road ahead

World models in 2026 represent early but rapid progress. Current limitations include:

  • Temporal consistency: Most systems maintain coherence for seconds to minutes, not hours
  • Physical accuracy: Generated physics is often plausible but not precise
  • Computational cost: Real-time generation requires significant GPU resources
  • Multimodal integration: Starchild-1 shows audio-video is possible; full sensory integration remains frontier research

The trajectory is clear: from text understanding to world understanding. As these systems improve, the gap between AI simulation and physical reality will continue to narrow.


Related reading


Sources


World model capabilities, APIs, and pricing evolve rapidly. Verify current specifications directly with vendors before integration decisions.

Related posts