

The next major shift in AI is happening outside the browser and beyond language. Autonomous systems now require a new class of models that do not process words but understand the physical world itself. The platform to teach them is called World Models or World Foundation Models (WFM). If large language models predict the next word in a sentence, world models predict the next event in a physical environment. They anticipate how a pedestrian might move, how vehicles might interact, or how a full traffic scene evolves.
This matters because physical autonomy cannot rely on text or static images. It needs spatial reasoning and an internal sense of physics. It needs models that can imagine thousands of futures, generate data that is impossible to capture manually, and train driving policies long before a vehicle touches real roads.
In the last two years, the world model space has accelerated faster than any other part of the Physical AI development. Research groups, automotive suppliers, simulation platforms, and end-to-end autonomy companies have begun building these models in public.
World Models are generative AI systems built to understand and predict the behavior of the physical world. They learn directly from video, multi-camera sequences, or sensor data and produce coherent future scenes. A world model can take a driving clip from Los Angeles, absorb the context, and continue the scene with realistic motion, traffic interactions, weather conditions, and dynamic agents.
The simplest way to explain them is:
This gives autonomy something it has never had: the ability to train on “dream” worlds at scale. A world model can generate thousands of scenarios that rarely occur in the real world. It can create endless variations of a single driving moment, changing agents, timing, lighting, and weather while keeping the scene physically consistent. This is the new backbone for training, testing, and validating end-to-end driving systems.

Every autonomy team faces the same problem: lack of real-world data.
Real-world data is limited, expensive, and often repetitive. Edge cases can occur once every tens of millions of miles, as weather varies by region and season, and traffic patterns differ between cities. No company, not even the largest automaker, can collect every case needed to train a reliable driving stack.
World Models fill this gap. They allow engineers to generate synthetic data that is representative, diverse, and controllable. They allow simulation to move beyond game-engine logic and instead rely on learned physics and behavior. They allow closed-loop end-to-end training without risking real vehicles. Most importantly, they allow researchers to stress-test driving policies with situations that would be dangerous to recreate on public roads.
That is why investors are paying close attention. Vinod Khosla recently stated that “World models will be as important as LLMs,” calling them the next major market and highlighting how much the field will depend on strong, large-scale datasets. Even Tesla, the company best positioned on real-world fleet data, acknowledges that this is still a long road. Elon Musk claimed that the hardest part is not getting autonomy to work, but getting from “sort of works” to “much safer than a human,” and that this gap takes several years to close. His point was clear: the data and learning curve are long, and whoever builds the best training loop wins.
In short, world models are the first technology that lets autonomy scale without putting millions of cars on the road. Here is a quick overview of some of the companies contributing to world model research.
Wayve has been building toward world models longer than almost anyone in the space. When they commenced operations in 2017, their work focused on front-facing imitation learning models. These early models predicted steering and speed based on a single camera view and were impressive for their time; however, they were limited in spatial awareness.
The major shift came with GAIA-2. Wayve expanded to a synchronized five-camera setup, allowing the model to understand how objects interact around the vehicle instead of only what appears ahead of it. GAIA-2 could generate new driving sequences, create realistic scene variations, and infer the dynamics of full traffic interactions.
GAIA-3 strengthens this direction. It improves temporal consistency, expands scene understanding, and pushes closer to a generative autonomy engine capable of imagining complex futures with physical accuracy. Over the last five years, Wayve has moved from narrow driving policies to some of the most complete multi-camera world models in the industry.
Five years ago, NVIDIA’s autonomy platform leaned heavily on traditional simulation. Omniverse provided high-fidelity digital twins, but most of the behavior inside those environments was hand-authored. NPC vehicles followed pre-written logic. Lighting and physics were rule-based. The system was powerful but limited by the usual sim-to-real gap.
Cosmos changes that picture entirely. Instead of crafting environments manually, Cosmos learns the rules of the physical world from video. It supports multiview generation, meaning it can recreate full 360° driving scenes that reflect how real autonomous vehicles perceive the road. This is a pivotal capability because modern AV platforms depend on complete surround-view understanding.
Cosmos ties directly into NVIDIA’s training and simulation suite. Real multi-camera footage can be used to generate countless scene variations, allowing end-to-end policies to be trained and tested in rich, realistic worlds. With Cosmos, NVIDIA has effectively transitioned from simulation to generative world modeling.
Waymo’s contributions to world models have been quieter but equally important. Historically, Waymo operated one of the most advanced manually constructed simulation platforms in the industry. Their CarCraft simulator was a benchmark for scenario testing, long before most AV companies had workable digital twins.
In recent years, Waymo has begun integrating learned generative components into this simulation stack. Their research emphasizes motion prediction, behavior forecasting, and the reconstruction of complex urban interactions. While much of their work remains internal, Waymo’s multi-sensor datasets and their long history of structured simulation make them a significant, if understated, force in world modeling
DeepMind has been building world models longer than anyone else in the space, even if much of the work was historically tucked inside reinforcement learning research. The release of Genie marked their most public demonstration of a general-purpose world model. Genie takes a single image as input and generates an interactive, controllable environment. It learns physics, dynamics, and scene evolution without hand-coded rules, and it can generalize far beyond driving.
Genie is not an automotive model, and it is not trained on multi-camera footage. Think of it as proof that world models can understand and simulate open environments with minimal supervision. It shows how far generative physical modeling can go when trained at scale. While Genie does not compete directly with models like GAIA or Cosmos, it demonstrates the direction many researchers expect world models to move toward: general-purpose simulators that can be adapted to robotics, gaming, and eventually mobility.
Meta has published several generative and predictive models relevant to world modeling, including video generation systems, large-scale multimodal architectures, and research into agents that reason over space and time. Some of their work focuses on embodied AI and robotics, and some focuses on large video datasets. While not directly commercialized for AV development, Meta’s models often serve as architectural blueprints for world model research in the autonomy sector.
Tesla is the longest-running practitioner of end-to-end driving and one of the earliest teams to push toward a latent world model architecture at scale. Their models are heavily front-facing today, powered by a suite of eight cameras arranged around the vehicle. Tesla uses world model principles primarily for planning and closed-loop control, where the model predicts how the environment will evolve and guides the driving policy accordingly.
Tesla has also publicly stated that it built a neural network world simulator that generates fully synthetic driving worlds for its autonomy stack. According to Tesla, this simulator can introduce new challenges like a pedestrian appearing or a car cutting in, replay older edge cases, and even be explored interactively like a video game, extending the same approach to Optimus.
Where Tesla differs from others is in scale. Their fleet provides billions of miles of data, giving them a continuous feedback loop between real-world behavior and their learned model of the environment.
Valeo was one of the first major automotive suppliers to open-source a substantial world model. VaViM predicts future driving frames. VaVAM converts visual predictions into driving actions. The most striking part of VaVAM is its emergence of defensive driving behavior, even though it was not explicitly trained to avoid collisions. It learned those behaviors naturally from the data.
Valeo’s models are primarily front-facing and focus on lane following, interactions with leading vehicles, and prediction under varying traffic conditions. Their decision to publish these models openly has helped accelerate academic and commercial research.
Comma began with a front-facing imitation learning approach similar to the early work done by Wayve. Their world model work focuses on predicting future frames and actions based on dashcam footage. Comma’s dataset is large and diverse, and their models specialize in predicting human-like driving behavior in regular conditions.
Over the last few years, Comma has evolved from simple steering prediction to more comprehensive video prediction models that help refine driving policies, identify edge cases, and improve scenario coverage. While not 360°, their front-facing approach captures a surprising amount of road context.

The next generation of world models will be multimodal and multiview, built from synchronized 360° camera arrays. Surround-view models are becoming essential because modern AVs do not drive with one camera. They perceive the world in all directions, and their training data must reflect that.
We will also see larger, more capable latent spaces. Models will learn physics implicitly. They will understand long-horizon interactions, not just the next frame. They will generate full cities, not just single clips. And they will integrate seamlessly into end-to-end driving pipelines.
This shift opens an interesting angle for NATIX.
World models need real multi-camera data, not just synthetic variants of front-facing clips. The NATIX multi-camera dataset, with complete 360° driving coverage across continents, weather, traffic cultures, and jurisdictions, gives world models the grounding they need to learn physical reality accurately. As the field moves toward surround-view world modeling, NATIX becomes a natural foundation layer for training and validating these systems.
World Models are becoming the backbone of next-generation autonomy. They offer a way to learn realistic physics, predict scene evolution, and generate massive amounts of training data that no real-world fleet can collect quickly enough. From Wayve and NVIDIA to Valeo and Tesla, every leading team is converging toward the same principle. The future of autonomy will be trained in a world of learned simulations, grounded by real multi-camera data.
As the field shifts from handcrafted modules to data-driven, generative end-to-end systems, world models will define the next decade of autonomous driving and robotics. And the companies that combine large-scale real data with powerful generative models will shape the future of Physical AI.
What follows is a high-level, technically grounded overview of the companies shaping this field and how world models are evolving into one of the most important technologies in Physical AI.