In the realm of human cognition, intelligence is often divided into two broad categories: crystallized and fluid. Crystallized intelligence refers to the accumulated knowledge, facts, and skills gained through experience and education—think vocabulary, historical facts, or procedural expertise. Fluid intelligence, on the other hand, is the ability to reason abstractly, solve novel problems, and adapt to new situations without relying on prior knowledge. It’s the mental agility that allows us to spot patterns in unfamiliar puzzles or innovate under uncertainty.
As large language models (LLMs) continue to evolve, these concepts provide a useful lens for understanding their capabilities and limitations. LLMs are powerhouses of crystallized intelligence, drawing from vast training datasets to regurgitate and remix information. However, they often falter in fluid intelligence tasks, where true generalization and creative reasoning are required. This post explores the dichotomy between crystallized and fluid intelligence in LLMs, how reinforcement learning (RL) intersects with them, and why benchmarks like ARC-AGI are crucial for measuring progress toward more human-like AI.
Crystallized Intelligence: The Backbone of LLMs
Crystallized intelligence in LLMs manifests as their impressive command over language, facts, and patterns derived from massive pre-training corpora. These models “learn” by absorbing trillions of tokens from books, websites, and code, building a repository of knowledge that enables tasks like answering trivia, generating essays, or even coding snippets. For instance, when asked about historical events or scientific concepts, LLMs can draw directly from memorized patterns, much like a seasoned expert recalling facts from years of study.
This strength is evident in benchmarks focused on domain-specific knowledge, where LLMs excel by leveraging what psychologists term “crystallized” abilities—applying learned information to familiar contexts. However, this reliance on pre-existing data means LLMs are essentially pattern-matchers at heart. They perform well on tasks that align with their training distribution but struggle when faced with out-of-distribution problems that demand novel reasoning.
Fluid Intelligence: Where LLMs Fall Short—and Where Progress is Being Made
Fluid intelligence requires adaptability, pattern recognition in unseen scenarios, and the ability to form analogies on the fly. In LLMs, this translates to challenges like solving abstract puzzles or generalizing rules from minimal examples. Traditional LLMs lean heavily on crystallized knowledge, often failing to exhibit true fluid capabilities because they can’t easily “think” beyond their training data.
A prime example is the Abstraction and Reasoning Corpus (ARC-AGI), a benchmark designed by François Chollet to test fluid intelligence in AI systems. ARC tasks involve grid-based puzzles where models must infer rules from a few demonstrations and apply them to new grids—mimicking human-like abstraction without relying on memorized facts. Early LLMs scored poorly here, highlighting a gap in fluid reasoning.
But recent advancements are closing this gap. Yeaterday, we witnessed the release of xAI’s Grok 4, which has made headlines by achieving a state-of-the-art 15.9% accuracy on ARC-AGI-2, the harder successor to the original benchmark. This score nearly doubles the previous best from models like Claude Opus 4 (at 8.6%) and outperforms others like o3 (around 3%). On the original ARC-AGI (v1), Grok 4 hits an impressive 66.7%, showcasing its enhanced ability to handle abstract reasoning. This isn’t just incremental; it’s a leap that suggests Grok 4 is “crushing” ARC-2 relative to peers, blending crystallized knowledge with improved fluid mechanisms like chain-of-thought prompting and simulated reasoning.
Looking ahead, one can’t help but wonder what ARC-3 might bring. If ARC-2 ramped up complexity with more intricate rules and contextual adaptations, ARC-3 could introduce even greater demands—perhaps multi-step analogies, noisy inputs, or real-time learning loops—to push AI toward human-level fluidity, where people solve nearly 100% of tasks.
The Intersection with Reinforcement Learning (RL)
Reinforcement learning bridges crystallized and fluid intelligence in LLMs by introducing goal-oriented adaptation. RL trains models through trial-and-error, rewarding actions that lead to successful outcomes. When integrated with LLMs (e.g., in RLHF—Reinforcement Learning from Human Feedback), it refines crystallized outputs for better alignment but also boosts fluid aspects by encouraging exploration of novel strategies.
For fluid tasks, RL can simulate “thinking” processes: an LLM might generate multiple hypotheses, evaluate them via RL policies, and iterate—like a human puzzling through options. In Grok 4’s case, its strong ARC performance likely benefits from RL-infused training, enabling it to compress and generalize patterns more effectively. This combination could evolve LLMs from static knowledge repositories to dynamic problem-solvers, where crystallized data serves as a foundation for fluid exploration.
Insights from Douglas Hofstadter’s “Fluid Concepts and Creative Analogies”
Douglas Hofstadter’s seminal book, Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought (1995), offers profound observations that resonate with today’s LLM challenges. Hofstadter argues that fluid intelligence isn’t about rote computation but analogy-making—the ability to map concepts from one domain to another creatively. His models, like Copycat, simulate how humans build “fluid” perceptions by blending high-level concepts with low-level details, adapting to ambiguity through probabilistic reasoning.
In LLMs, this ties directly to fluid shortcomings: most models excel at surface-level matching (crystallized) but falter in deep analogies, often producing brittle responses to variations.
Hofstadter, analogy is the “fuel and fire of thinking,” and RL can enhance this by rewarding analogical leaps, much like Grok 4’s reasoning mode on ARC tasks. As AI progresses, incorporating Hofstadter’s ideas—such as emergent, self-organizing analogies—could make fluid intelligence more robust, turning LLMs into truly creative thinkers.
Tying It All Together: Kolmogorov Complexity as a Unifying Lens
To make sense of this interplay, consider Kolmogorov complexity, a concept from information theory popularized in AI discussions by Ilya Sutskever (co-founder of OpenAI). Kolmogorov complexity measures the “intelligence” of a system by the shortness of the program needed to describe or generate data—the essence of compression as understanding. Sutskever has noted in presentations that true intelligence involves finding minimal descriptions for complex phenomena, linking compression to prediction and generalization.
In LLMs, crystallized intelligence reflects pre-compressed knowledge from training data—efficient for familiar tasks but high-complexity (long “programs”) for novelties. Fluid intelligence, conversely, demands on-the-fly compression: inferring short rules from sparse examples, as in ARC puzzles. RL aids by optimizing these compression policies, reducing complexity through rewarded efficiency.
Grok 4’s ARC-2 dominance exemplifies this: it compresses abstract rules better than rivals, halving the “program length” needed for solutions. Hofstadter’s analogies are a form of compression, mapping disparate ideas into compact structures. As we speculate on ARC-3, it might test extreme compression under uncertainty, demanding AI that embodies Kolmogorov’s ideal: maximal understanding from minimal code.
Ultimately, the fusion of crystallized (stored compression) and fluid (adaptive compression) via RL is propelling LLMs forward. Benchmarks like ARC-AGI aren’t just tests—they’re catalysts for breakthroughs, as seen with Grok 4. While we’re not at AGI yet, these elements suggest a path: AI that learns, analogies, and compresses like us, unlocking intelligence that’s both deep and agile. And, until that lands, perhaps through a unified embedding scheme (different discussion for a different post) we won’t really be able to make statements about AGI.