I'd say with confidence: we're living in the early days. AI has made jaw-dropping progress in two major domains: language and vision. With large language models (LLMs) like GPT-4 and Claude, and vision models like CLIP and DALL·E, we've seen machines that can generate poetry, write code, describe photos, and even hold eerily humanlike conversations.
But as impressive as this is, it’s easy to lose sight of the bigger picture: we’ve only scratched the surface of what artificial intelligence could be — because we’ve only scaled two modalities: text and images.
That’s like saying we’ve modeled human intelligence by mastering reading and eyesight, while ignoring touch, taste, smell, motion, memory, emotion, and everything else that makes our cognition rich, embodied, and contextual.
Human intelligence is multimodal. We make sense of the world through:
Touch (the texture of a surface, the feedback of pressure, the warmth of skin0; Smell and taste (deeply tied to memory, danger, pleasure, and even creativity); Proprioception (the sense of where your body is in space — how you move and balance); Emotional and internal states (hunger, pain, comfort, fear, motivation).
None of these are captured by current LLMs or vision transformers. Not even close. And yet, our cognitive lives depend on them.
Language and vision are just the beginning — the parts we were able to digitize first - not necessarily the most central to intelligence.
The real frontier of AI lies in the messy, rich, sensory world where people live. We’ll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.