Are words the best building blocks for AI?

Are words the best building blocks for AI?
Contemporary large language models operate on subword tokens, fragments of words derived from compression schemes like Byte Pair Encoding. For purely linguistic tasks, this representational choice has proven remarkably effective. Yet the moment we ask these systems to reason about the physical world, their performance degrades in ways that resist the usual fixes of more parameters or more data. I have come to believe that the root cause is more fundamental than most discussions acknowledge. The bottleneck is not the model architecture, nor the training corpus. It is the token itself.
Ask GPT-5.4 whether a bowling ball or a feather hits the ground first on the moon and it will nail it. That is textbook physics, written down a million times. But ask it to predict what happens when you slide a full coffee mug across a table toward a stack of books, and the answers get shaky. Which way does the coffee slosh? Does the mug stop or topple when it hits the stack? Does the stack scatter or absorb the impact? These are not exotic scenarios. A 5 year old who has knocked over a drink has better intuitions here than a frontier model, because the 5 year old's knowledge is grounded in physical experience, not in text that happens to mention mugs.
This is an old problem wearing new clothes. In 1990, Stevan Harnad called it the symbol grounding problem. A system that manipulates symbols without any connection to what those symbols refer to is not really understanding anything. It is doing sophisticated pattern matching over a formal system that floats free of the world. Think of the Chinese Room argument, but applied to the entire representational substrate. Harnad argued that symbols need to be grounded in sensorimotor interaction with the environment to carry genuine meaning. 35 years later, we have scaled the ungrounded symbol manipulation to a breathtaking degree, and the failure modes he predicted are exactly the ones we see.
Language is a great interface for human communication, but as a representation of reality it is lossy, indirect, and socially constructed. It does not natively encode physics, geometry, causality, or dynamics. LLMs get impressive world knowledge by learning statistical correlations across massive text corpora, but they are reconstructing a world model from what is essentially a flattened linguistic projection of the actual world. The model has never felt the weight of anything, never watched liquid pour, never experienced an object resist being pushed. It is working from descriptions of these things, and descriptions always leave out the parts that seemed too obvious to mention.
This is also the frame problem showing up in a modern context. You cannot enumerate all the real world factors that stay constant or change in a given situation using language alone. When you move a cup from a table, the table is still there, gravity still applies, the liquid inside still has momentum. Nobody writes these things down because they are obvious to any embodied agent. But a system reasoning purely from text has no mechanism to know what to hold constant and what to update. Biological intelligence does not try to enumerate invariants. It uses representations with strong inductive biases already aligned to how the physical world works. The invariants are baked into the format of the representation itself.
Why language makes a shaky foundation
Words are convenient and compositional, but they are a surprisingly bad substrate for general intelligence once you start poking at why. The problems stack up in layers.
The grounding gap. The meaning of chair is rooted in shared sensorimotor experience. You have sat in one, seen one, knocked one over. You know what a chair affords (sitting, standing on, wedging under a doorknob) because you have a body that has interacted with chairs. The token chair is a statistical shadow of all that experience. LLMs work with the shadow, not the thing. This is not a minor detail. It means the model's "understanding" of chair is a cloud of co-occurrence statistics, with chair appearing near sit, table, wooden, comfortable. That is enough to answer trivia. It is not enough to predict that a 3 legged chair will tip if you lean left.
No native physics. Geometry, contact dynamics, forces, material properties, causal counterfactuals. None of it is natively encoded in language. The model has to reconstruct all of it from correlation. Sometimes correlation is a decent proxy. Often it is not, and the failures cluster around exactly the cases where physical reasoning matters. Novel object configurations, edge cases in manipulation, anything involving fluid dynamics or deformable objects. Ask a model to describe what happens when you inflate a balloon inside a cardboard box and you will see the cracks immediately.
Combinatorial explosion of the unstated. The true set of relevant invariants in any physical scenario is vastly larger than what language typically describes. A sentence like "she put the glass on the shelf" implies that the shelf is horizontal, rigid, and strong enough to support the glass. It implies that the glass is upright, that gravity is acting downward, that the glass is not already on the shelf. Nobody spells this out because it is obvious to an embodied reader. But a system that must infer the complete physical state from text alone faces a combinatorial explosion of implicit assumptions, and it only needs to get one wrong to produce nonsensical downstream reasoning.
What a better token would look like
If we want more capable, grounded AI systems, the fundamental unit of computation needs to evolve beyond word fragments. A richer token system would encode the structure of the world more directly.
What would that actually look like? I think it means encoding structured geometry (objects, surfaces, poses, depth), dynamics (contact, friction, forces), agency (actions, intentions, affordances), causality (interventions, counterfactuals), hierarchical time (event centered temporal structure at variable rates), and explicit uncertainty (beliefs, hypotheses, confidence). That is a lot to ask of a token, but the current alternative of hoping that statistical correlations over text will somehow reconstruct all of this clearly is not working for anything that requires real physical reasoning.
Some promising directions
Nobody has solved this, but a few research threads are chipping away at pieces of it in ways I find genuinely promising.
Object centric representations are probably the most mature. Work like Slot Attention (Locatello et al.) and SAVi learns to decompose scenes into discrete object slots, each carrying attributes like pose, shape, and velocity. This is a natural fit for the factored structure of physics. The world actually is made of distinct objects that interact, and if your representation mirrors that, prediction and planning get dramatically easier. The limitation is that these methods mostly work on simple synthetic scenes. Scaling them to cluttered, real world environments is an active problem.
Action effect tokens go a step further and tokenize interactions directly. Consider something like (action: pinch-grasp, result: stable lift, observed_slip: 0.02) where the token itself encodes the causal relationship between what you did and what happened. This is close to how robotics researchers think about skill primitives, and it makes the learned representation directly useful for planning. You can chain action effect tokens into a sequence and ask whether the resulting trajectory is physically plausible.
Causal program tokens take a different angle entirely, representing mechanisms as small executable program fragments that can be composed. Dreamcoder-style systems (Ellis et al.) learn libraries of programmatic abstractions from examples. The appeal is that programs are inherently compositional and simulatable. You can run them forward, run them in reverse for inference, and combine them in ways that preserve their semantics. The challenge is that learning the right program primitives from raw sensory data remains extremely hard.
On the perception side, multimodal latent tokens (VQ-VAE style discrete codes shared across vision, audio, and proprioception) offer a way to create a common representational currency across modalities. And geometric tokens, implicit field descriptors like NeRF latents or scene graph nodes with differentiable hooks into rendering and simulation, let you represent 3D structure in a form that is both learnable and physically meaningful.
None of these are ready for prime time. Most are barely past proof of concept. But they all point in a direction I find compelling. The token should reflect the structure of the world, not just the structure of the language we use to talk about it.
Connecting language to the world
Even if we build richer tokens, language is not going away. Humans communicate in it, goals get specified in it, explanations need to come back in it. So the architectural question becomes how to bridge word tokens and these richer representations without either side dominating.
Cross-attention is the more established path. Systems like Flamingo and PaLI use it to let a language model query information from other modalities, including image patches, audio signals, and whatever else you plug in. The language stream can attend to specific regions of a visual representation, effectively pointing at the perceptual grounding it needs. This works well when language needs to reference perception ("the red block on the left"), but it still treats the language model as the backbone. Perception serves language, not the other way around.
Joint-embedding methods flip that relationship in an interesting way. LeCun's V-JEPA learns by predicting the representations of masked out video clips in a shared latent space, rather than reconstructing raw pixels. By skipping pixel level prediction, the model gets pushed toward learning exactly the kind of higher level structure we want, including object persistence, motion dynamics, and physical context. The representations that emerge are not captioned experience. They are a learned compression of visual reality that captures the invariances text leaves out. This is what gives you perceptual representations actually worth bridging to.
I think you need both. Cross-attention gives language a way to query the world. Joint-embedding methods give the world a representation worth querying.
What this changes about learning
Adopting structured tokens shifts the learning objective from predict the next subword to something more like predict the next multimodal latent state or predict the causal consequences of an action. That sounds like a small change in framing. In practice it reshapes the entire learning dynamic.
Credit assignment gets far sharper. When a robot drops an object, a system built on structured tokens can trace the failure to a specific parameter. Perhaps grip orientation was 15 degrees off, or contact force was insufficient. A language token system can only propagate error back through a vaguely correlated word somewhere in a long prompt. The structured representation makes the error surface navigable instead of opaque.
You get real compositionality. This is the big one. If tokens align with objects and mechanisms, they can be recombined in physically lawful ways. You can take a "grasp" token learned on mugs and compose it with a "pour" token learned on pitchers and get a reasonable plan for pouring from a mug, even if that exact combination never appeared in training. Language tokens compose syntactically, which is powerful, but structured world tokens compose physically, which is what you need for generalization to novel tasks.
Prediction becomes verifiable. When your prediction target is the next state of a physics aligned representation, you can check it against simulation or actual observation. When your target is the next word, "correctness" is just likelihood under a distribution. Structured tokens let you close the loop between prediction and reality in a way that subword prediction never can.
What might a system built on these principles actually look like?
Roughly, picture a central memory of object and event centric tokens representing a scene graph. Add learned codebooks for each modality mapping raw sensor data to structured tokens, a planner operating over sequences of action effect tokens, a translation layer to map natural language instructions into constraints on the structured token space, and a narration layer to map token trajectories back into language for explanation. Language is still there because humans need it, but it is the interface, not the substrate.
The bigger picture
I think this reframes what language should be in AI systems. It should not be the fundamental substrate of thought, but rather a high level interface for control, goal specification, and explanation. It is a human friendly communication layer on top of a more robust, world centric computational core.
Harnad's symbol grounding problem was a philosophical argument in 1990. 35 years and several trillion parameters later, it is an engineering constraint. We have scaled ungrounded symbol manipulation further than anyone expected, and the results are genuinely impressive for the kinds of tasks where language is a sufficient representation, such as summarization, code generation, and analysis of text. But for anything that requires understanding the physical world, we keep hitting the same wall, just at higher levels of sophistication.
We have been building the stack upside down, starting from language and trying to get to physics. It might be time to start from the world and let language sit on top where it belongs. The path there is not clear yet. But I think the first step is recognizing that the token, that humble unit of computation we have been taking for granted, is where the bottleneck actually lives.