Large Language Models (LLMs) and their multimodal extensions have revolutionized human-AI interaction, yet they frequently falter on deceptively simple tasks. These failures stem not from a lack of intelligence but from architectural constraints and training data idiosyncrasies. Below, we examine two emblematic cases: numerical comparisons like 9.9 versus 9.11, and generating images of clocks showing specific times such as 2:00 PM.
Case 1: The Decimal Dilemma
Determining whether 9.9 or 9.11 is larger should be straightforward—9.9 (or 9.90) clearly exceeds 9.11. However, many LLMs erroneously claim the opposite. This error arises primarily from tokenization, where inputs are fragmented into tokens like ["9", ".", "11"] for 9.11 and ["9", ".", "9"] for 9.9. Without native arithmetic capabilities, models rely on probabilistic patterns from training data, where "9.11" often appears in contexts implying superiority, such as software versions (e.g., version 9.11 > 9.9) or dates (e.g., 9/11). String-based comparisons further compound this, as "11" lexically or numerically outranks "9" in substrings, overriding decimal logic. Studies show that LLMs simulate reasoning via next-token prediction, but biases from non-mathematical corpora—dominated by version numbers and historical references—skew outputs. Advanced techniques like chain-of-thought prompting help, but the core limitation persists in untuned models.
Case 2: The Clock Conundrum
Similarly, requesting an LLM to generate "a picture of a clock that reads 2:00 PM" often yields an image stuck at 10:10 or with inaccurately placed hands. This isn't due to misunderstanding "PM" (which doesn't alter an analog clock's face) but reflects deep-seated biases in image generation models like DALL-E or Stable Diffusion. Trained on vast datasets of captioned images, these systems inherit conventions from commercial photography: watch advertisements overwhelmingly set analog clocks to 10:10 for aesthetic appeal—it frames the brand logo symmetrically and evokes a "smiling" face via pareidolia. Diffusion-based generators lack explicit knowledge of clock mechanics; they pattern-match rather than compute hand positions (e.g., hour hand at 60 degrees for 2:00, minute at 0). When prompts specify other times, the model defaults to the overrepresented 10:10 archetype, ignoring instructions due to data scarcity for precise, varied clock depictions. This highlights a broader issue: generative AI excels at interpolation but struggles with extrapolation beyond dominant training patterns.
In both cases, these pitfalls underscore AI's reliance on correlative learning over causal understanding. As models evolve—through refined datasets, hybrid symbolic integration, or specialized fine-tuning—such quirks may diminish. Until then, users must employ targeted prompting or external tools to circumvent them, reminding us that AI, for all its prowess, remains a reflection of its human-curated origins.