Nvidia's New AI Model Teaches Robots to See and Act, Not Just Look

Nvidia is stepping deeper into artificial intelligence infrastructure with Cosmos 3, an open world model built specifically to help robots, autonomous vehicles and other physical systems understand and navigate real environments.

The company trained Cosmos 3 on 20 trillion tokens of multimodal data: nearly a billion images, 400 million real and synthetic videos, ambient audio, text, and crucially, action data from humans and robots. That last ingredient separates it from ordinary video generation tools.

The action data is designed to model how machines actually move and behave, not just how scenes look on screen. Developers can use Cosmos to simulate actions in physical spaces, then build specialized models for their robots and machines on top of that foundation.

Ming-Yu Liu, vice president of Nvidia's Cosmos Lab, emphasized that autonomous action modeling is the key difference. The system generates specific action data like robot joint angles, gripper positions and movement trajectories that can train machines to manipulate and navigate the physical world.

Nvidia is releasing two versions immediately: a high-accuracy "super" model for applications like robot training and autonomous vehicles, and a "nano" model that produces results in fractions of a second. An "edge" model designed to run locally on devices is coming soon.

The company is positioning Cosmos as an open model, similar to its earlier Nemotron family. That approach lets hardware manufacturers customize the system to their needs and shapes future versions around industry demands. Nvidia is backing the effort with an initial coalition that includes Agile Robots, Black Forest Labs and Runway.

One major advantage: Cosmos can generate rare or dangerous scenarios, such as robot collisions or unusual road events, that would be difficult, expensive or unsafe to capture repeatedly in real life.

The move reflects a broader industry shift toward world models as a critical frontier in AI. Companies want to take the capabilities of chatbots and agents and enable them to perform actual physical tasks. Competitors like World Labs and AMI Labs are working in this space as well.

Nvidia's strategy extends beyond chips into AI models and software. The company is betting that the next wave of AI development won't simply answer questions or generate images, but will need to predict, simulate and act in physical environments. By making Cosmos open and developer-friendly, Nvidia wants its platform to become foundational infrastructure for that emerging wave.

Author James Rodriguez: "Nvidia is making a smart play by turning world models into a commodity tool for developers instead of keeping them locked down, but whether robots actually learn to navigate better from synthetic action data is still the real test."

Comments