Ian Bigford

Are words the best building blocks for AI?

7/13/20256 min read

Are words the best building blocks for AI?

Here's something that's been bugging me about the way we build language models: they operate on subword tokens—fragments of words, derived from compression schemes like BPE. This works incredibly well for language tasks. Prediction, paraphrasing, style transfer—all great. But the moment you want an AI that can reason about the physical world, you hit a wall. And I think the wall is the token itself.

The core issue is a representational mismatch.

The Representational Gap — LLMs train on the text shadow of reality, like Plato's cave, missing the physical world that casts it

Language is a fantastic interface for human communication, but it's a lossy, indirect, socially-constructed representation of reality. It doesn't natively encode physics, geometry, causality, or dynamics. LLMs get impressive world knowledge by learning statistical correlations across massive text corpora, but they're fundamentally reconstructing a world model from a flattened, linguistic projection. When a task depends on real-world invariants that aren't explicitly captured in language, things break—often in ways that look like the model "almost gets it" but doesn't quite.

This is essentially the frame problem: you can't enumerate all the real-world factors that stay constant or change in a given situation using language alone. It's computationally intractable. Biological intelligence doesn't even try. Instead, it uses representations with strong inductive biases that are already aligned with how the physical world works.

Why language makes a shaky foundation

Words are convenient and compositional, but they have real weaknesses as a substrate for general intelligence:

  • They're not grounded. The meaning of "chair" is rooted in shared sensorimotor experience—sitting in one, seeing one, moving one. The token chair is just a statistical shadow of that experience. LLMs work with the shadow, not the thing.
  • They don't carry physical structure. Words don't inherently encode geometry, contact dynamics, forces, or causal counterfactuals. The model has to infer these properties from correlation, which is indirect and often unreliable.
  • They create combinatorial nightmares. Trying to infer the complete state of a situation from text alone leads to combinatorial explosion. The true set of relevant invariants is far larger than what language can describe or imply.

What a better token would look like

If we want more capable, grounded AI systems, the fundamental unit of computation needs to evolve beyond word fragments. A richer token system should encode the structure of the world directly.

Rich Token Anatomy — contrasting a flat word token (just an ID) with a structured world token containing layers of geometry, physics, affordance, and semantics

I think that means:

  • Structured geometry — objects, surfaces, poses, depth, occlusion.
  • Dynamics — contact, friction, forces, physical stability.
  • Agency — actions, intentions, plans, affordances.
  • Causality — interventions, counterfactuals, latent mechanisms.
  • Hierarchical time — event-centered temporal structure that can operate at variable rates.
  • Uncertainty — explicit beliefs, hypotheses, confidence intervals.

Some promising directions

There's a lot of interesting research exploring alternatives to word-based tokens:

  • Object-centric tokens represent scenes as sets of discrete objects, each carrying attributes like pose, shape, and velocity. This aligns nicely with the factored nature of physics.
  • Action-effect tokens tokenize interactions themselves—something like (action: pinch-grasp, result: stable lift, observed_slip: s). The token is the causal relationship.
  • Causal program tokens represent mechanisms as small, executable, compositional program fragments that can be recombined.
  • Multimodal latent tokens use shared discrete codes (from something like a VQ-VAE) across vision, audio, and proprioception, linked by common object or event identifiers.
  • Geometric tokens employ implicit field descriptors (think NeRF latents) or scene graphs with differentiable hooks for rendering and simulation.
  • Belief tokens are explicit, updatable hypothesis units that get managed based on incoming evidence and carry uncertainty measures.

None of these are mature yet. But they all point in the same direction: the token should reflect the world, not just the language we use to talk about it.

Connecting language to the world

The big architectural question is how to bridge word tokens and these richer representations. Two approaches seem particularly promising, and they're complementary.

Cross-attention is the more established path. In multimodal systems, cross-attention lets a language model query and integrate information from other modalities—image patches, audio signals, whatever. It creates a bridge where linguistic concepts can "point to" relevant perceptual data. This matters for building systems where language can condition or describe non-linguistic states.

Joint-embedding methods like V-JEPA take a different angle. Instead of predicting raw pixels, V-JEPA learns by predicting the representations of masked-out video clips in a shared latent space. By skipping the expensive pixel reconstruction step, the model is pushed toward learning higher-level semantic features—object interaction, motion, physical context. The learning signal points directly at the kind of structured, causal invariances that word tokens lack.

Cross-attention gives you the interface between language and perception. Joint-embedding methods give you the perceptual representations worth interfacing with. You need both.

What this changes about learning

Adopting structured tokens shifts the learning objective from "predict the next subword" to something like "predict the next multimodal latent state" or "predict the causal consequences of an action." Two immediate benefits:

  1. Better credit assignment. When something goes wrong, you can point to a specific parameter in a structured token—say, grip orientation was off—rather than tracing failure back to a vaguely correlated word in a long prompt.
  2. Real compositionality. If tokens align with objects and mechanisms, they can be recombined in physically lawful ways. This is how you get generalization to new tasks and environments without just hoping the training distribution covered it.

What might a system built on these principles look like?

The Inverted Stack — language becomes a thin interface wrapper while the core reasoning happens in a grounded 3D scene graph and physics engine, fed by perception encoders

Something like:

  • A central memory of object- and event-centric tokens representing a scene graph
  • Learned codebooks for each modality to map raw sensor data to structured tokens
  • A planner that operates over sequences of action-effect tokens
  • A translation layer to map natural language instructions into constraints on the structured token space
  • A narration layer to map token trajectories back into language for explanation

The bigger picture

This reframes what language should be in AI systems. Instead of the fundamental substrate of thought, language becomes a high-level interface—for control, goal specification, and explanation. A human-friendly communication layer built on top of a more robust, world-centric computational core.

The choice of token is a choice of ontology. If we want AI that's robustly grounded in reality, its core representations need to reflect the structure of the world, not just the structure of the words we use to describe it.