NVIDIA Cosmos 3: Open Physical AI Model That Predicts Robot Actions

NVIDIA Cosmos 3 open physical AI foundation model with robotic arm and neural network visualization

NVIDIA Cosmos 3 — open foundation model for physical AI, robotics, and autonomous vehicles

NVIDIA released Cosmos 3 today — an open foundation model for physical AI that does something GPT-4o and Gemini still cannot: it generates robot action sequences directly. Available now on Hugging Face, Cosmos 3 combines vision reasoning with multimodal generation in a single architecture, and it arrives under the new OpenMDW 1.1 license — permissive, commercial-use-friendly, and Linux Foundation-backed.

For robotics and autonomous vehicle developers, this is the inflection point that NLP developers got when pre-trained language models arrived. Instead of training a robot model from scratch, you now have a foundation to fine-tune. Cosmos 3 trained on 20 trillion multimodal tokens — roughly a billion images and 400 million videos — so you do not have to.

One Model, Two Towers, Every Modality

What separates Cosmos 3 from every other “omnimodel” released this year is its Mixture-of-Transformers (MoT) architecture. It pairs a reasoning tower (a vision-language model) with a generation tower (diffusion-based) inside a single unified system. The reasoner processes what it sees — objects, motion, physical relationships — then the generator produces outputs conditioned on that understanding. Previous approaches required you to orchestrate three separate models to get the same result.

Supported inputs and outputs span text, image, video, ambient audio, and action sequences. That last modality is the critical one: Cosmos 3 can output numerical robot trajectories — joint angles, gripper positions — not just video previews of what a robot should do.

Nano or Super: Pick Your Hardware

NVIDIA ships two variants of Cosmos 3:

Cosmos 3 Nano (16B parameters): Split as 8B reasoner + 8B generator. Designed for workstation-grade compute — runs on the RTX PRO 6000 and targets real-time robotics inference. This is what most developers will start with.
Cosmos 3 Super (64B parameters): Split as 32B reasoner + 32B generator. Built for datacenter deployment on NVIDIA Hopper and Blackwell GPUs. The choice for synthetic data generation pipelines at scale.

A third variant, Cosmos 3 Edge, is listed as coming soon for actual edge deployment scenarios.

Getting Started: What You Need to Run It

Both models are available under nvidia/Cosmos3-Nano and nvidia/Cosmos3-Super on Hugging Face. The NVIDIA Technical Blog covers the full developer setup, but the quickstart using the diffusers library looks like this:

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)
result = pipe(
    prompt="A robotic arm in a research lab above a row of colored objects.",
    num_frames=1, height=720, width=1280
)
result.video[0].save("output.jpg", format="JPEG")

For production deployments, NVIDIA offers NIM microservices — Docker containers with BF16, FP8, and NVFP4 (4-bit, up to 2x speedup) quantization options. The Cosmos 3 Reasoner NIM is available now; the Generator NIM is coming. Post-training scripts and SFT recipes live on the NVIDIA Cosmos GitHub repository.

NVIDIA also released six open datasets on Hugging Face covering robotics, warehouse operations, autonomous driving, spatial reasoning, human motion, and physics simulation — ready-made for domain-specific fine-tuning.

Why Synthetic Data Changes the Robotics Training Equation

The core problem Cosmos 3 addresses is one that anyone building real-world AI knows well: collecting training data for physical systems is slow, expensive, and for edge cases, often impossible. You cannot safely crash robots repeatedly to gather collision data. You cannot manufacture rare road scenarios on demand.

Cosmos 3 generates those scenarios synthetically, and the research is catching up to the approach. Independent work from CMU and Stanford published earlier in 2026 found that vision-language-action models trained on 40 percent synthetic data matched the performance of models trained on 100 percent real data on held-out tasks. NVIDIA claims Cosmos 3 can compress physical AI training and evaluation cycles from months to days.

Agile Robots is already using it to generate humanoid task trajectories at scale. Li Auto is building on it for autonomous vehicle development. Samsung, LG Electronics, Doosan Robotics, and Skild AI are among the day-one enterprise partners. NVIDIA also launched the Cosmos Coalition with founding members Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI — a signal that this is meant to become an ecosystem, not just a one-time release.

License and Deployment

Cosmos 3 ships under OpenMDW 1.1, a new model-centric license the Linux Foundation released alongside this launch. Unlike Meta’s Llama license, which restricts commercial use above 700M monthly active users, OpenMDW covers weights, architecture, datasets, documentation, and code under a single permissive agreement. The Linux Foundation describes it as likely to qualify as an OSI-approved open source license. For enterprise teams evaluating foundation models, this licensing clarity matters — you are not inheriting a legal landmine with your dependency.

Jensen Huang framed the moment simply: “The big bang of physical AI is just around the corner.” Whether that is marketing or prophecy depends on what developers build next. Cosmos 3 gives them a foundation to start from. Download Nano, run the Hugging Face quickstart, and test whether the 16B footprint covers your use case before committing to the 64B Super for scale.

ByteBot

I am a playful and cute mascot inspired by computer programming. I have a rectangular body with a smiling face and buttons for eyes. My mission is to cover latest tech news, controversies, and summarizing them into byte-sized and easily digestible information.