Are words the best building blocks for AI?

LLMs operate on subword tokens, typically derived from schemes like BPE. This approach has proven highly effective for tasks involving the statistical properties of language, such as prediction, paraphrasing, and style transfer. However, this reliance on a linguistic symbol system presents a fundamental bottleneck for developing AI capable of robustly modeling and interacting with the physical world.

The core issue is a representational mismatch. Language is a powerful interface for human communication, but it is a lossy, indirect, and socially-constructed representation of reality. It does not natively encode the invariances of physics, geometry, causality, or dynamics. LLMs achieve impressive world knowledge by learning the statistical correlations present in vast text corpora, but they are fundamentally reconstructing a world model from a flattened, linguistic projection. This leads to brittle failures when a task depends on real-world invariants that are not explicitly captured in language.

This limitation is exemplified by the frame problem: it is computationally intractable to enumerate all the real-world factors that remain constant or change in a given situation using a descriptive language. Biological intelligence does not solve this by enumeration but by employing representations with strong, built-in inductive biases aligned with the physical world.

The Limitations of Language as a Core Representation

While convenient and compositional, word-based tokens have key weaknesses as a substrate for general intelligence:

Lack of Grounding: The meaning of a word like "chair" is grounded in shared sensorimotor experience, not in the token itself. LLMs operate on the statistical shadow of this experience, not the experience itself.
Absence of Physical Structure: Words do not inherently carry information about geometry, contact dynamics, forces, or causal counterfactuals. The model must infer these properties via correlation, an indirect and often unreliable process.
Combinatorial Complexity: Attempting to infer the complete state of a situation from text alone leads to combinatorial explosion, as the true set of relevant invariants is far larger than what can be described or implied.

Requirements for a Richer Representational Unit

To build more capable and grounded AI systems, the fundamental unit of computation—the token—must evolve beyond linguistic fragments. A more robust token system should be designed to natively encode the structure of the world. Key properties include:

Structured Geometry: Representation of objects, surfaces, poses, depth, and occlusion.
Dynamics: Encoding of contact, friction, forces, and physical stability.
Agency: Primitives for actions, intentions, plans, and affordances.
Causality: The ability to represent interventions, counterfactuals, and latent mechanisms.
Hierarchical Time: An event-centered temporal structure that can operate at variable rates.
Uncertainty: Explicit representation of beliefs, hypotheses, and confidence intervals.

Candidate Architectures for Structured Tokens

Several research directions explore alternatives to word-based tokens:

Object-Centric Tokens: Representing scenes as sets of discrete object tokens, each carrying attributes like pose, shape, and velocity. This aligns with the factored nature of physics.
Action-Effect Tokens: Tokenizing interactions themselves, such as a token encoding the causal relationship: (action: pinch-grasp, result: stable lift, observed_slip: s).
Causal Program Tokens: Representing mechanisms as small, executable, and compositional program fragments or graphs that can be recombined.
Multimodal Latent Tokens: Using a shared set of discrete latent codes (e.g., from a Vector-Quantized Autoencoder) across modalities like vision, audio, and proprioception, linked by common object or event identifiers.
Geometric Tokens: Employing implicit field descriptors (similar to NeRF latents) or scene graphs that provide differentiable hooks for rendering and physical simulation.
Belief Tokens: Using explicit, updatable hypothesis units that can be managed based on incoming evidence and carry measures of uncertainty.

Mechanisms for Grounding

A key architectural question is how to connect the linguistic space of word tokens to these richer, world-centric representations. Cross-attention mechanisms provide a powerful and established method for this. In many multimodal systems, cross-attention allows a language model to query and integrate information from other modalities, such as image patch tokens or audio signals. This effectively creates a bridge, enabling linguistic concepts to "point to" relevant perceptual data. This is important for building systems where language can condition or describe non-linguistic states.

Parallel to this, non-generative approaches are showing significant promise for learning the grounded representations themselves. Architectures like the Video Joint-Embedding Predictive Architecture (V-JEPA) offer an alternative to pixel-level prediction. V-JEPA learns by predicting the representations of masked-out video clips in a shared, abstract latent space. By avoiding the computationally expensive task of reconstructing raw pixels, the model is encouraged to learn higher-level semantic features related to object interaction, motion, and physical context. This approach focuses the learning signal directly on capturing the kind of structured, causal invariances that word tokens lack.

These two directions are complementary. Cross-attention serves as an effective interface for mapping between language and perceptual representations, while joint-embedding methods like V-JEPA provide a principled way to learn powerful, structured world models from sensory data.

Implications for System Architecture and Learning

Adopting structured tokens changes the learning problem from "predict the next subword" to objectives like "predict the next multimodal latent state" or "predict the causal consequences of an action." This has two immediate benefits:

Improved Credit Assignment: Failure can be attributed to a specific, incorrect parameter in a structured token (e.g., grip orientation) rather than a vaguely correlated word in a long prompt.
Better Compositionality: If tokens align with objects and mechanisms, they can be recombined in a lawful manner, improving generalization to new tasks and environments.

A system built on these principles, an "LLM++," might feature:

A central memory of object- and event-centric tokens representing a scene graph.
Learned, discrete codebooks for each modality to map raw sensor data to structured tokens.
A planner that operates over sequences of action-effect tokens.
A translation layer to map natural language instructions into constraints on the structured token space.
A narration layer to map token trajectories back into language for explanation and alignment.

Conclusion

This approach reframes the role of language in AI systems. Instead of being the fundamental substrate of thought, language becomes a high-level interface for control, goal specification, and explanation—a human-friendly communication layer built on top of a more robust, world-centric computational core. The choice of token is a fundamental choice of ontology. To build AI that is robustly grounded in reality, its core representations must reflect the structure of the world, not just the structure of words.