← Blog
explainx / blog

NVIDIA Cosmos 3: Open Physical AI World Models for Robots and Autonomous Systems

NVIDIA Cosmos 3 is an open omnimodal world model suite for Physical AI, combining reasoning, generation, sound, video, and action modeling.

14 min readYash Thakker
NVIDIACosmos 3Physical AIWorld modelsRobotics

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

NVIDIA Cosmos 3: Open Physical AI World Models for Robots and Autonomous Systems

NVIDIA Cosmos 3 is the new open model family inside NVIDIA's Cosmos platform for Physical AI: robots, autonomous vehicles, industrial video systems, simulation pipelines, and synthetic-data workflows. The public repository positions Cosmos 3 as an omnimodal world model that can reason over text and vision while also generating images, videos, sound, and action sequences.

The important shift is not just another video model. Cosmos 3 exposes two runtime surfaces: Reasoner for understanding and planning, and Generator for world simulation, future prediction, sound/video generation, and action-conditioned rollouts. As of June 4, 2026, the GitHub repository shows roughly 8.7k stars, one launch release, and model access through the NVIDIA Cosmos 3 Hugging Face collection.

This post summarizes the public README, NVIDIA Cosmos page, and linked developer materials as of June 4, 2026. For the event context around Jensen Huang's broader NVIDIA announcements, read our NVIDIA Computex 2026 recap. Check the upstream repo before pinning install commands, benchmark claims, CUDA choices, or license decisions.


TL;DR

QuestionShort answer
What is it?An open omnimodal world-model family for Physical AI, published under the NVIDIA/cosmos repo
Core surfacesReasoner for text output from text/vision; Generator for image, video, sound, and action outputs
ArchitectureUnified Mixture-of-Transformers with autoregressive reasoning and diffusion-based multimodal generation
Models listedCosmos3-Nano 16B, Cosmos3-Super 64B, Super Text2Image 64B, Super Image2Video 64B, Nano Policy DROID 16B
Developer pathsDiffusers, Transformers, vLLM-Omni, vLLM, NIM, and Cosmos Framework
Main caveatOutputs can still break physically; safety-critical use needs validation beyond model inference

What Cosmos 3 is

Cosmos is NVIDIA's open platform of world models, datasets, and tools for building Physical AI. The broader platform includes Cosmos Framework, Cosmos Curator, and Cosmos Evaluator; Cosmos 3 is the newest model family inside that stack.

The NVIDIA Cosmos product page describes Cosmos 3 as an open Physical AI foundation model with native reasoning, world generation, and action generation built on Mixture-of-Transformers. The public README says the model family jointly processes and generates language, images, video, audio, and action sequences.

That makes Cosmos 3 easier to place if you compare it with adjacent model classes:

Model typeTypical jobCosmos 3 overlap
Vision-language modelUnderstand images/video and answer questionsReasoner surface
Video generatorGenerate video from text or imagesGenerator surface
World simulatorPredict how scenes evolveGenerator future prediction and forward dynamics
Robot policy modelPredict or condition on actionsAction modeling and policy workflows
Synthetic-data engineCreate training data at scaleVideo, sound, and action-conditioned outputs

NVIDIA's framing is that Physical AI teams should not need one model for captioning, another for simulation, another for action prediction, and another for video generation. Cosmos 3 attempts to make these capabilities share one architectural backbone and one developer ecosystem.


Reasoner vs Generator

The cleanest way to understand the release is to separate the two runtime surfaces.

SurfaceInputsOutputsBest fit
ReasonerText and visionTextCaptioning, temporal localization, 2D grounding, embodied reasoning, physical plausibility, planning
GeneratorText, vision, sound, actionVision, sound, actionText-to-image, text-to-video, image-to-video, video-to-video, forward dynamics, policy rollouts

Reasoner

Reasoner is the understanding path. It accepts text plus images or video and returns text. In the README examples, this covers detailed captioning, timestamped event localization, common-sense physical judgment, bounding-box grounding, describe-anything prompts, action chain-of-thought, driving-scene reasoning, and likely-next-action prediction.

The message format follows Qwen3-VL-compatible conventions. A basic request shape looks like this:

[
  {
    "role": "system",
    "content": [{ "type": "text", "text": "You are a helpful assistant." }]
  },
  {
    "role": "user",
    "content": [
      { "type": "video_url", "video_url": "https://example.com/video.mp4" },
      { "type": "text", "text": "List the notable events with approximate timestamps." }
    ]
  }
]

Reasoner is the better path when you want an answer, a plan, a classification, a JSON grounding result, or an explanation of visible physical context.

Generator

Generator is the world-production path. It accepts text, vision, sound, and action conditioning, then produces non-text outputs: images, videos, synchronized sound, and action states.

The README examples include:

WorkflowInputsOutputs
Text-to-imageTextVision
Text-to-videoTextVision
Text-to-video with soundTextVision and sound
Image-to-videoText and imageVision
Video-to-videoText and videoVision
Forward dynamicsText, vision, actionFuture visual state
Action policyText and visionAction and rollout video

The distinction matters operationally. If you are building a video analytics agent, Reasoner is the starting point. If you are generating synthetic robot training clips or predicting future observations from an action trace, Generator is the starting point.


Model family

The release README lists five primary model entries:

ModelSizePrimary capability
Cosmos3-Nano16BCompact omnimodal model for multimodal understanding, simulation, future prediction, action reasoning, and Physical AI
Cosmos3-Super64BLarger omnimodal model for advanced understanding, simulation, future prediction, and action reasoning
Cosmos3-Super-Text2Image64BHigh-fidelity text-to-image generation
Cosmos3-Super-Image2Video64BTemporally coherent image-to-video generation
Cosmos3-Nano-Policy-DROID16BVision-language robot policy for DROID manipulation and control

This model list is worth checking against older summaries. Some earlier coverage, including our NVIDIA Computex event recap, discussed Cosmos 3 in terms of smaller Nano/Super parameter counts around the keynote messaging. The public GitHub README now lists 16B and 64B entries for the launch artifacts, so use the repository as the canonical current reference.


Architecture in plain English

Cosmos 3 uses a unified Mixture-of-Transformers architecture with two jobs inside one model family:

  1. Autoregressive reasoning for language and visual understanding.
  2. Diffusion-based generation for images, video, audio, and action tokens.

In Reasoner mode, the model processes language and visual tokens through causal self-attention, similar to how a multimodal language model predicts the next text token. In Generator mode, noisy multimodal tokens are denoised through full attention, which is closer to the diffusion path used in modern image and video generators.

Both modes share the same high-level transformer architecture, multimodal attention layers, and a 3D multidimensional rotary position embedding representation. The 3D positional design matters because world models need to represent not only what appears in a frame, but also where it is and how it changes over time.

For a robotics team, that means Cosmos 3 is trying to keep perception, temporal prediction, and action-conditioned generation in the same representational space instead of stitching together separate systems after the fact.


Inputs, outputs, and generation settings

Cosmos 3 supports a broad I/O surface, but the defaults are still concrete enough to plan around.

AreaPublic README detail
Input typesText, text + image, text + video, text + image + action
Input formatsText string, JPG/PNG/JPEG/WEBP images, MP4 video, JSON action arrays
Output typesImage, video, sound, action state, text
Output formatsJPG image, MP4 video, AAC sound muxed into MP4, JSON action values, text
Resolution tiers256p, 480p, 720p; default 480p
Aspect ratios16:9, 4:3, 1:1, 3:4, 9:16; default 16:9
Frame rates10, 16, 24, 30 FPS; default 24 FPS
Frame count5 to 300 frames; default 189
Prompt guidanceFewer than 300 words is recommended for world-generation prompts
Sound outputStereo AAC at 48 kHz when generated with video

Action conditioning is where Cosmos 3 becomes more specialized than a general video model. The README lists support for action dimensions across camera motion, autonomous vehicles, egocentric motion, single-arm robots, dual-arm robot settings, and humanoid robots. That is the part Physical AI teams should inspect most closely, because action dimensionality and embodiment assumptions determine whether a demo maps to a real control pipeline.


How to get started

Before running examples, the README asks developers to create a Hugging Face token and authenticate locally:

uvx hf@latest auth login

From there, choose the integration based on the job.

GoalUseNotes
Generator researchDiffusersPython-first path for inspecting generation behavior
Generator production servingvLLM-OmniOpenAI-compatible API for image, video, sound, and action outputs
Reasoner researchTransformersListed as coming soon in the README
Reasoner production servingvLLMOpenAI-compatible endpoint for text outputs from text and vision inputs
Turnkey Reasoner deploymentNIMPrebuilt optimized container
Training and evaluationCosmos FrameworkFull workflow docs for inference, training, and evaluation

Diffusers path

The Diffusers path is aimed at Generator research and model development. The README installs the latest Diffusers from GitHub alongside acceleration and media dependencies:

uv venv --python 3.13 --seed --managed-python
source .venv/bin/activate
uv pip install --torch-backend=auto \
  "diffusers @ git+https://github.com/huggingface/diffusers.git" \
  accelerate \
  av \
  cosmos_guardrail \
  huggingface_hub \
  imageio \
  imageio-ffmpeg \
  torch \
  torchvision \
  transformers

The important operational note: --torch-backend=auto is there to match your installed NVIDIA driver with a compatible CUDA wheel. If you force a newer CUDA wheel than your driver supports, torch.cuda.is_available() can return False even though the machine has a GPU.

vLLM-Omni path

For Generator serving, the README points to vLLM-Omni. The official Docker image is the practical path while full upstream support continues landing:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v "$(pwd):/workspace" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-omni:cosmos3 \
  vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --model-class-name Cosmos3OmniDiffusersPipeline \
  --allowed-local-media-path / \
  --port 8000 \
  --init-timeout 1800

The long init timeout is not cosmetic. Large checkpoints can exceed default server startup limits, so the README recommends --init-timeout 1800.

Reasoner serving

For Reasoner production inference, use vLLM behind an OpenAI-compatible chat-completions API. For teams that do not want to manage vLLM and CUDA setup directly, the README also documents a Reasoner path through NVIDIA NIM.


CUDA and container constraints

Cosmos 3 is not a laptop toy unless that laptop is effectively a serious NVIDIA workstation. The README lists:

  • Operating system: Linux
  • Precision: BF16 tested
  • GPU architectures: NVIDIA Ampere, Hopper, and Blackwell
  • CUDA: CUDA 13 recommended, CUDA 12.8 supported
  • Base containers: NGC PyTorch 25.09-py3 for CUDA 13 or 25.06-py3 for CUDA 12

The most common setup trap is a mismatch between system CUDA, driver support, PyTorch's CUDA build, and the uv torch backend. If torch.cuda.is_available() is false, do not assume Cosmos is broken. Check the driver, check nvidia-smi, check torch.version.cuda, and install a matching torch backend.

The README also calls out minimal container failures such as missing libxcb.so.1 or libgl1. On headless servers, install the system graphics packages before blaming model code:

apt-get install -y libxcb1 libgl1 libglib2.0-0

Benchmarks and what to read

NVIDIA keeps Cosmos 3 serving and generation benchmarks in inference_benchmarks.md. The README says those tables cover:

Benchmark areaSurfaceWhat it measures
Cosmos3-Nano generatorGeneratorText-to-image, text-to-video, and image-to-video latency across PyTorch, vLLM-Omni, and Diffusers
Cosmos3-Super generatorGeneratorThe same generation modalities at larger checkpoint scale
Cosmos3-Nano reasonerReasonervLLM serving metrics such as time to first token, request latency, and throughput under concurrency

Use those numbers as engineering inputs, not marketing conclusions. For deployment planning, the real questions are:

  • Which exact checkpoint?
  • Which resolution and frame count?
  • Which GPU and tensor-parallel setup?
  • Which serving stack?
  • Is the benchmark measuring first-token latency, full request latency, diffusion generation time, or throughput?

World-model benchmarks are especially easy to misread because "video generation latency" and "chat-completion latency" are not comparable workloads.


Use cases that actually fit

Cosmos 3 is most interesting where teams need models that understand or simulate physical state, not just produce attractive clips.

Robot learning

Robot teams can use Cosmos 3 for visual reasoning, task planning, next-action prediction, action-conditioned rollouts, and policy-model development. The Cosmos3-Nano-Policy-DROID entry is a direct signal that NVIDIA is targeting manipulation and control, not only video demos.

The hard part is still embodiment. A robot policy is not portable just because two tasks both involve "a robot arm." Camera layout, gripper type, action space, environment distribution, and safety constraints all matter.

Autonomous vehicle training

Cosmos 3 can generate future rollouts and synthetic data from visual and action context. That is useful for weather diversity, lighting variation, rare events, and policy stress tests.

The failure mode is over-trusting plausible video. A clip can look physically reasonable while still violating sensor geometry, road-agent behavior, or downstream planner assumptions. AV use needs evaluation against simulator constraints, real logs, and safety cases.

Industrial video agents

Reasoner can support dense captioning, situation understanding, physical plausibility analysis, and temporal localization across factory, warehouse, logistics, traffic, and inspection footage.

For this use case, Cosmos 3 sits near NVIDIA's existing video analytics work. It may become a stronger reasoning and synthetic-data component inside broader video search, alerting, and summarization systems.

Synthetic data generation

The Generator path can produce images, video, synchronized sound, and action-conditioned future states. That makes Cosmos 3 relevant when real-world data is expensive, rare, dangerous, private, or hard to label.

Synthetic data still needs measurement. Teams should track whether generated data improves target-task performance, where it introduces bias, and whether rare-event generation creates believable but wrong edge cases.


Cosmos 3 vs other world-model approaches

The world-model landscape is splitting into several shapes:

ApproachExampleOutput styleBest for
Omnimodal Physical AI modelCosmos 3Text, image, video, sound, actionRobotics, AV, physical reasoning, synthetic data
Persistent 3D world generationTencent HY-World 2.03DGS, meshes, point cloudsEditable worlds and engine import
Interactive playable worldsGoogle Genie-style systemsVideo or playable scene rolloutsAgent training and game-like interaction
Real-time audiovisual world modelsOdyssey Starchild-style systemsStreaming audio-videoInteractive media and multimodal environments
Video understanding modelsVLMs and video agentsText or structured outputsSearch, captioning, safety, monitoring

Cosmos 3's differentiator is breadth across reasoning, generation, and action. Persistent 3D systems may be better when you need editable assets. Pure VLMs may be cheaper and simpler when you only need answers from video. Video generators may be more accessible when the goal is creative content rather than physical prediction.


Limitations

The README is explicit that Cosmos 3 can still produce artifacts in long, high-resolution, or physically complex outputs. Listed failure modes include:

  • Temporal inconsistency
  • Unstable camera or object motion
  • Inaccurate sound-video alignment
  • Imperfect action-state consistency
  • Object morphing
  • Inaccurate 3D structure
  • Implausible physical dynamics

Those are not minor caveats for Physical AI. They are the boundary between a useful research system and a deployable control system.

For safety-critical robotics, autonomous driving, industrial automation, or multi-agent behavior, Cosmos 3 should be treated as one component inside a validated pipeline. You still need simulation checks, real-world tests, policy constraints, monitoring, fallback behavior, and license review.


Source links


Bottom line

Cosmos 3 is NVIDIA's most concrete open attempt to make Physical AI development feel like a unified model stack: reason over video, generate plausible futures, condition on actions, serve through OpenAI-compatible APIs, and train or evaluate through the Cosmos ecosystem.

The release is strongest for teams that already understand GPU infrastructure, simulation, robotics data, or video analytics. For everyone else, the right first step is not "deploy a robot." It is to pick one bounded workflow: caption a video, localize an event, generate a short action-conditioned rollout, or benchmark a text-to-video path on a known GPU.

Status note: repository stars, model listings, CUDA guidance, and vLLM-Omni compatibility were checked against public NVIDIA materials on June 4, 2026. Verify upstream links before using this for procurement, benchmark claims, or production architecture.

Related posts