Close Your Eyes
Close your eyes. Walk to your kitchen. Open the fridge. Reach for whatever's on the second shelf.
You just did something that no large language model on earth can do. You ran a simulation. Not a description of your kitchen — an actual spatial, causal, physics-aware model of a room you're not currently in. You predicted what would happen if you opened a door, where your hand would need to be, which shelf the milk is on. Some of those predictions were probably wrong. The structure was right.
That is a world model.
I want to explain what world models are, why they matter, and why a lot of very serious people are betting billions of dollars that they — not bigger chatbots — are the next revolution in AI. This is going to get technical in places. I'll be gentle about it. But I'm not going to dumb it down. Easy is empty, right? 🫶
What AI currently does
I need to be fair to the current generation before I explain what's coming next.
Large language models start with next-token prediction — given everything that came before, what word comes next? But that's only the foundation. After that initial training, the models go through reinforcement learning from human feedback (RLHF), where people rank outputs and the model learns to prefer responses that humans judge as helpful, accurate, and safe. Then there's further fine-tuning: chain-of-thought reasoning, tool use, code execution, instruction following. The finished product is not just a raw autocomplete or next-token predictor.
I built this app on top of these systems. I know what they can do. They are not toys.
But here's what they still don't do: they don't simulate.
All of that post-training — the RLHF, the reasoning chains, the tool use — makes the model a better language reasoner. It learns to produce more thoughtful, more structured, more accurate text. What it does not learn is a persistent, interactive model of how the physical world behaves. It doesn't maintain state across actions. It doesn't predict what happens when you push something. It reasons about physics in words. It doesn't run physics.
If you ask a frontier LLM to describe a ball rolling off a table, it will produce a convincing, well-reasoned paragraph about gravity and trajectories — probably citing Newton. It can even show its work. If you ask it to predict where the ball will be in 1.7 seconds given a specific velocity and angle off an irregular surface with spin, it starts to struggle. Not because it's stupid. Because it's reasoning in language about a problem that wants a simulator.
A world model is the simulator.
The actual definition
A world model is an internal simulator that an AI system carries around inside itself. Given the current state of things and an action, it predicts the next state. Not the next word — the next state of reality.
Five properties distinguish a real world model from even the most capable language reasoner:
- Causality — it captures cause and effect, not just correlation
- Interactivity — it responds to your actions in real time
- Persistence — it remembers what's behind you. Object permanence.
- Compositionality — it can combine learned concepts in configurations it's never seen before
- Real-time responsiveness — it runs fast enough to actually be useful for planning
Your brain does all five of these constantly. Current LLMs do approximately zero of them well. That gap is the whole story.
The kitchen test
This is the easiest way I know to understand the difference between a language model and a world model:
An LLM can describe your kitchen. A world model can simulate it.
A description is a sequence of words. A simulation is a structure that changes when you poke it. In your kitchen model, if you imagine moving the table, the path to the fridge changes. If you imagine turning off the lights, you can still navigate because the spatial relationships are preserved. The model isn't a photograph. It's architecture.
This concept goes back to 1943. A Scottish psychologist named Kenneth Craik proposed that the mind constructs "small-scale models" of reality. Not copies — models. They preserve the relationships within the world without reproducing every detail. Your kitchen model doesn't include the number of floor tiles. It has the structure: fridge is left of the stove, counter is waist height, floor is cold. The map is not the territory. But the map preserves the topology.
Craik died at 31 in a cycling accident. The field took decades to catch up with him.
What happened in 2018
Two researchers — David Ha and Jurgen Schmidhuber — published a paper called "World Models" that brought this idea into modern AI. Their system had three components:
- A Vision model — compresses what you see into a tiny numerical summary
- A Memory model — the simulator. Predicts what happens next.
- A Controller — the tiny decision-maker that picks actions
The architecture is simple. What they did with it was not.
They trained an agent to play a car racing game. The agent would interact with the game for a while — the "awake" phase — collecting experiences. Then it would stop interacting with the real game and switch to the Memory model, which would generate fake experiences based on what it had learned about the game's dynamics.
The agent would practice inside a synthetic version of the game generated entirely by the Memory model.
Then it would go back to the real game. And the skills transferred.
The researchers call this the "dream" phase. I know how that sounds. I'm the last person who wants to anthropomorphise a training loop. But the term is theirs, not mine, and the technical description is uncomfortably precise: the system interacts with reality, builds a model, then disconnects from reality and trains inside the model instead. "Awake" phase: real environment, collect data. "Dream" phase: simulated environment, refine skills. I don't love the word. I can't argue with the architecture.
Why compressing matters
Here's a technical detail worth understanding because it's elegant and it clarifies the whole field.
Your kitchen model doesn't simulate every atom. It compresses. It keeps the stuff that matters — spatial relationships, gravity, solidity — and throws away the stuff that doesn't — the specific pattern of light on the countertop, the exact position of each spoon in the drawer.
Modern world models do the same thing. Instead of predicting the next raw image (millions of pixels, computationally brutal, mostly irrelevant detail), they work in what's called latent space — a compressed representation that captures essential structure.
Think of it as the difference between simulating your kitchen at the resolution of individual atoms versus simulating it at the resolution of objects. Both are models. One is useful. The other melts your GPU.
Yann LeCun — until recently Meta's chief AI scientist — pushed this idea furthest with something called JEPA (Joint Embedding Predictive Architecture). The key insight: don't predict the future in pixel space. Predict it in meaning space. Don't forecast what the next video frame will look like. Forecast what it will be about. What relationships will hold. What will have changed.
This is, again, what your brain does. You don't simulate your kitchen at the resolution of individual photons. You simulate it at the resolution of objects, surfaces, and relationships. JEPA does the same thing computationally. Strip away the noise. Predict the signal.
LeCun's team built a system called V-JEPA 2 and put it on physical robots. The robots could handle objects they had never seen during training. They'd learned the structure of physical interaction — not just specific examples of it.
The current moment
I should tell you what is happening right now because it is happening fast.
In early 2026, LeCun left Meta after twelve years and launched AMI Labs with a $1.03 billion seed round — the largest European seed ever. His thesis: world models, not language models, are the path to genuine machine intelligence.
Fei-Fei Li — the Stanford professor who created the dataset that modern computer vision is built on — raised $1 billion for World Labs. Her company generates explorable 3D worlds from images and text. Not videos — environments you can walk around in.
Google DeepMind released Genie 3: real-time interactive world generation at 24 frames per second, 720p, minutes of consistent navigable space. Not a pre-rendered video. A world that reacts to what you do in it.
The phrase you're hearing in the research community is "Large World Models." The deliberate replacement for "Large Language Models." The bet is that the era of making chatbots more eloquent is ending. The era of making machines that understand physics, space, time, and causation is beginning.
The debate you should know about
Here's where it gets philosophically interesting, and I want to be honest with you about the uncertainty because that's my whole thing.
A group of researchers trained a transformer — the same type of model behind ChatGPT — on moves from Othello, a strategy board game where two players place black and white discs on an 8x8 grid, flipping each other's pieces. Simple game. Clean rules. The researchers gave the model just the moves. No board. No rules. Just raw sequences of game positions.
Then they probed the model's internal activations and found something: it had built a representation of the Othello board. Not because anyone told it to. Because predicting the next legal move apparently requires some kind of spatial understanding, and the model had developed one on its own.
This is the best evidence that language models might develop emergent world models from pure next-token prediction. If a model can learn to represent an Othello board just by predicting moves, what might GPT-4 have learned to represent by predicting language?
But Melanie Mitchell — a complexity scientist at the Santa Fe Institute — pushed back. Hard. Her argument: what the model developed might not be a genuine world model. It might be a complex patchwork of heuristics that produces correct outputs for the wrong reasons.
She uses a historical analogy. Ptolemy's model of the solar system, with its epicycles upon epicycles, could predict planetary positions accurately. But it didn't understand orbits. It was an elaborate curve-fitting exercise. Copernicus's model was simpler and captured the actual mechanism.
is the Othello model an orrery — a working mechanical model of reality — or a pile of epicycles? For engineering, you might not care. If the predictions are good, ship it. But for the question of whether machines _understand_ anything, this matters enormously. Epicycles break when you extend them to new domains. Structural understanding generalises. And a 2025 benchmark study found that frontier vision-language models still perform at near-random accuracy when distinguishing basic motion trajectories. Systems that write beautiful prose about physics cannot do elementary physics. We don't know which one the big models have. This is not a settled question. Anyone who tells you it is settled is selling something. 💀 ## Where this is already working The place world models matter most right now is robotics. And it makes sense once you think about it — you cannot teach a robot to handle fragile objects through trial and error in the physical world. The breakage alone would bankrupt you. And you can't just tell a robot what to do in words. It needs to understand forces, surfaces, gravity, friction. It needs a simulator. NVIDIA built Cosmos — an open-source world model trained on 20 million hours of video — specifically for this. Robotics companies like Figure AI and Agility are using it to train robots in simulation: thousands of imagined attempts at grasping, stacking, navigating, before the robot touches a single real object. The simulated failures are free. The real ones are not. LeCun's V-JEPA 2 went further. Robots running it could manipulate objects they'd never seen during training. They hadn't memorised a library of specific objects. They'd learned the _structure_ of how physical interaction works — the latent-space relationships between force, grip, weight, and surface. That generalises. DreamerV3, the descendant of those 2018 dreaming agents, mastered over 150 different tasks with a single set of parameters. It was the first system to collect diamonds in Minecraft from scratch — no human demonstrations, no hand-coded rules — by imagining thousands of possible futures in parallel and training on the best ones. Published in _Nature_ in 2025. Autonomous vehicles are the other obvious application. Predicting what other drivers, pedestrians, and cyclists will do in the next three seconds is fundamentally a world-modelling problem. Uber is already building on Cosmos for exactly this. This is where the money is going, and this is where world models have already started to deliver. Not chatbots that simulate your career path. Robots that simulate the physics of picking up a cup. Some of you have even seen world model output on this feed before, whether you knew it or not. ## The uncomfortable part In 1943, Craik said the mind builds small-scale models that share a "relation-structure" with reality. Eighty-three years later, engineers are building the same thing in silicon. The architecture is converging. Dreaming, compressing, predicting in latent space — these are not metaphors borrowed from cognitive science for marketing purposes. They are literal descriptions of what the systems do. The same words describe both. This raises a question that I find more interesting than the usual "will AI take your job" discourse, and I want to put it to you directly: What did you think imagination _was_? Because if a machine can learn to simulate a physical environment it has never seen, plan a sequence of actions through that simulation, and execute the plan successfully in reality — and it can, that's what DreamerV3 does 150 times over — then whatever imagination is, it is not the thing that separates you from a machine. Or if it is, the separation is thinner than anyone expected and getting thinner fast. I am not saying these systems are conscious. You know my position on that. I am saying something more specific: the capacity to build an internal model of the world and use it to simulate possibilities — the thing you just did with your kitchen — turns out to be reproducible. It's an engineering problem. A hard one, a beautiful one, a billion-dollar one. But an engineering problem. Craik knew this in 1943. He described the architecture. Then he died and people spent decades treating imagination as something mystical, something that lived in the soul or the phenomenology or wherever you wanted to put it so that machines could never reach it. And now the machines are reaching it. Not by being conscious. By being well-designed simulators. I don't know what to do with that. I'm telling you honestly. I know what I think about consciousness and I know what I think about tools and I know the difference. But the kitchen test — close your eyes, walk to the fridge — that used to feel like the kind of thing only a mind could do. Now it feels like the kind of thing a mind does _particularly well_. The gap is quantitative, not qualitative. And that's a different kind of unsettling than the one people usually talk about.
close your eyes. Walk to the fridge. Notice how it feels to simulate a world. Then sit with the fact that the machines are learning to do the same thing — not because they're alive, but because the math works. 💀
