

Almost every team building an end-to-end driving model starts the same way. They look at imitation learning, decide they need the highest quality demonstrations possible, and begin hiring or recruiting professional drivers. Calm hands. Smooth merges. Perfect lane discipline. The logic feels obvious. Better drivers in, better policies out.
Then the model meets the real world, and the assumption breaks.
This article is about why that happens, and why the most counterintuitive lesson in end-to-end driving systems is the one that takes the longest to accept. For behavior cloning and imitation learning, perfect demonstrations make worse models. The right training corpus is not the cleanest data from the best drivers. It is the wide, varied, sometimes messy distribution of how the world actually drives.
End-to-end driving treats the task as one large learning problem. Instead of writing separate rules for perception, prediction, and planning, the system learns to map sensor input directly to driving actions. This is the direction the industry is moving, especially as teams like Tesla, Wayve, and Xpeng push toward vision-only, mapless autonomy.
Imitation learning is the default training method for this approach. The most common form, behavior cloning, is simple in principle. Collect a large dataset of states paired with the actions a human driver took in those states, and train a neural network to predict the action given the state. The model is, at its core, copying the behavior it saw in the data.
Because the entire driving task is being learned from examples, the dataset is no longer a supporting input. It is the entire training signal. Whatever the dataset contains is what the model learns. Whatever the dataset lacks is what the model will never know how to do.
This is where the temptation begins. If the dataset is everything, then the dataset must be perfect. So teams collect the best demonstrations they can. Professional drivers, controlled routes, clean recordings. The dataset looks beautiful. The benchmarks look promising. And then the model is deployed.
Behavior cloning has a well-known failure mode called covariate shift. In plain language: the model is trained on the states an expert visits, but at inference time, it sees the states it itself visits. These are not the same distribution.
A model trained on expert demonstrations performs well as long as the world matches what the expert saw. The moment the model makes a small mistake, even a minor deviation from the expert trajectory, it ends up in a state the expert never occupied. From that unfamiliar state, the model's next prediction is unreliable. That error compounds the next one, and the next, until the trajectory has drifted far outside anything in the training data.
This is not a niche failure. It is the central reason behavior-cloned driving policies struggle in the real world. The literature has documented this carefully. The end-to-end autonomous driving survey describes how small, compounding errors push the agent into out-of-distribution states that were not present in the expert's dataset, often leading to a catastrophic failure spiral. A separate reality check on behavioral cloning shows the same pattern in real driving conditions.
The reason the model cannot recover is mechanical. It was never shown how to recover. Its training set, by definition, contains only the trajectories a careful driver produced. Recoveries from drift, late corrections, hesitations at intersections, awkward merges that worked out anyway: these never made it into the curated dataset. The cleaner the dataset, the fewer of them there are.

This is where the counterintuitive part becomes visible. Expert drivers are, by definition, the people who almost never make the mistakes the model is most likely to make. Their lane-keeping is too consistent. Their following distance is too smooth. Their merges are too well-timed.
A model trained only on this kind of data learns the average expert trajectory through a scene. What it does not learn is the wide neighborhood of recoveries around that trajectory. When the model drifts a few degrees off-center, it has no reference for how to come back. When it brakes a fraction late at an intersection, it has no reference for how a real driver salvages that situation. The model has been trained on a thin ribbon of perfect behavior in a world that is anything but thin.
The result is a brittle policy that performs beautifully on benchmarks built around expert-like conditions and fails the moment its own imperfection puts it somewhere expert demonstrations never went.
This is not new. Waymo's ChauffeurNet paper made the same point in 2018. The team argued that imitation alone is not enough, and that training has to include both good behavior and synthesized bad behavior. They explicitly augmented their dataset with perturbations: trajectories deliberately pushed off-center, collisions, and off-road events. The model needed to see what wrong looks like, and how a driver gets back from it, even if a real expert would never have produced that data on purpose.
If one of the most capable autonomous driving teams in the world had to inject imperfection into expert data, that tells us something structural. Pure expert demonstrations are not a complete training signal. They never were.
There is an even deeper issue. When demonstrations are too clean, models can latch onto features that correlate with the expert's actions but do not actually cause them.
The clearest example comes from a paper by Pim de Haan, Dinesh Jayaraman, and Sergey Levine. They studied what they called causal confusion in imitation learning, and one of their results is worth sitting with. A driving policy trained with access to brake light information learns to brake only when the brake light is already on. When the brake light is off, the policy fails to brake for pedestrians at all. A policy without access to the brake light, forced to attend to the pedestrian directly, performs correctly in both cases.
The model with more information did worse. It found a shortcut. The brake light was a near-perfect predictor of the expert's brake action in the training data, so the model learned to predict that signal instead of learning the underlying cause. In the real world, that shortcut breaks.
This is what happens when training data is too well-behaved. Spurious correlations become indistinguishable from real causes. The model learns to mimic the surface pattern of expert behavior without learning the reasons behind it. Variance in the data is what forces the model to attend to actual causes, because the shortcut stops working when the data is messy enough.
In other words, the cleaner the training set, the easier it is for the model to fool itself.
If expert datasets fail because they are too narrow, the question becomes what a useful dataset actually contains.
The answer is the full distribution of how real driving happens. Late braking. Hesitation at unfamiliar intersections. Lane corrections after attention slips. Cautious behavior in heavy rain. Aggressive lane changes from impatient drivers. Slow merges from cautious ones. Soft drifts back to center. Awkward but successful navigation through construction zones.
None of this is expert behavior. All of it is recovery behavior. Together, it forms the natural distribution of states a deployed driving system will actually encounter, and the natural distribution of responses humans actually use to handle them.

A model trained on this kind of data sees what drift looks like, how it begins, and how a driver pulls out of it. It sees the moment when a pedestrian appears unexpectedly, and the human reacts a fraction late, and what they do next. It sees imperfect merges that work, and ones that almost do not. It sees how the average driver, not the expert, handles a sudden unprotected left.
This is the data the model actually needs, because this is the data that resembles what the model will produce. The point of imitation learning is not to copy a flawless expert. It is to learn a policy that holds together when the world stops cooperating. That requires being trained on the world as it is, not on a curated version of how it should be.
Recent research points in this direction. Approaches like DIVER generate multiple reference trajectories from a single ground-truth one, explicitly to break the imitation bottleneck created by relying on a single expert demonstration. Wayve's published work on fleet learning makes the same argument from the data side. Their continuous improvement loop is fed by petabytes of varied driving, not by a tight expert corpus, because diversity is what drives real generalization.
These are really two different theories of training data, and they produce very different systems.
A curated expert dataset is small, clean, and predictable. It is easy to label, easy to inspect, and easy to benchmark against. It is also narrow. It contains a thin slice of driving from a thin slice of drivers in a thin slice of conditions. A model trained on it will be strong on tasks that look like that slice and brittle on everything else. It will also be vulnerable to causal confusion, because the lack of variance makes shortcuts easy.
A large driver population is the opposite. It is messy, uneven, and resists tidy labeling. It contains thousands of skill levels, dozens of driving cultures, every weather condition, every road type, and every variety of imperfect human behavior. It is harder to curate. It is also the right shape for end-to-end training, because it matches the distribution of states the model will face after deployment.
The most important point is structural. Variance is not noise in this kind of training. Variance is the asset. The diversity of the dataset is what builds reliable behavior, both because it covers more states and because it forces the model past spurious shortcuts and toward real causes.
Synthetic data and scenario generation can amplify coverage on top of this, but they cannot replace it. Synthetic data reflects the assumptions of the people who built the simulator. Real-world driver behavior is what grounds the model in the world it will actually drive in.
This is the bottleneck NATIX is built around.
NATIX runs a distributed contributor network where thousands of ordinary drivers, in their own cars, in their own cities, capture real driving data with multi-camera systems. The result is not a curated set of expert demonstrations. It is a wide, ongoing stream of how real drivers handle real conditions across geographies, weather, road types, and skill levels.
For end-to-end imitation learning, this is the right shape of a dataset. It is varied enough to contain the recovery behaviors that the expert data lacks. It is large enough to cover the long tail of conditions a deployed model will see. It is diverse enough to push past the causal shortcuts a narrow set invites. And because it is multi-camera and global, it carries the real-world grounding for training and validation that synthetic data cannot supply on its own.
The structural point matters more than any single product. A distributed contributor network is not a stylistic choice. For models trained through imitation, it is the natural source of the distribution the model needs to learn. The multi-camera world foundation model NATIX is building with Valeo is one example of what that kind of data unlocks at scale.
In other words, NATIX is not a tradeoff against curated data. It is the right shape for the problem that curated data cannot solve.
The expert driver problem is one of those results that feels wrong at first and obvious in hindsight. If a system learns by imitation, the question is not who drives best. The question is whose behavior the system will need to recover to when its own predictions go slightly off.
For end-to-end driving, that is never a single expert. It is the full real distribution of how the world drives. The clean, polished demonstrations every team instinctively reaches for are the smallest, narrowest, least useful version of the training signal that imitation learning needs.
The best driving models are not trained on the best drivers. They are trained in the world.