Presenters

Source

Beyond the Hype: What Large Language Models Still Struggle With 🤯

We’re constantly bombarded with dazzling charts showing the relentless upward march of AI capabilities. Every new model release feels like a giant leap towards Artificial General Intelligence (AGI), leaving us in a state of awe and perhaps a little anxiety. But are we being too optimistic? Peter Gostev, in a recent talk, dives into the less glamorous side of LLMs, exploring what they don’t do well and why that matters.

The Illusion of Constant Progress: A Deeper Look 📈

Peter highlights a common perception: the benchmark charts always show a steady climb. At Arena, where they track model performance, they’ve seen over 700 text models and the trend is undeniably upward. However, Peter argues this isn’t the whole story. He introduces two key methods to uncover these hidden limitations: a novel “nonsense” benchmark and a deeper dive into Arena’s extensive user-voted data.

The Nonsense Benchmark: When Models Go Along for the Ride 🤪

Peter’s personal benchmark is brilliantly simple: What happens when you ask LLMs nonsensical questions? Do they politely point out the absurdity, or do they gamely try to answer? With about 155 questions designed to be illogical, the results were eye-opening.

Key Findings:

  • The “Pushback” Metric: Green indicates the model recognized the nonsense and pushed back. Amber and red mean the model accepted and tried to answer the nonsensical query.
  • Surprising Complacency: Peter was surprised by how easily many models accepted and attempted to answer complete gibberish.
  • Top Performers (in this specific test): Latest Claude models and some Qwen and Grok models showed stronger pushback.
  • The Usual Suspects Struggle: GPT and Gemini models were often around 50/50, sometimes going along with the nonsense. Even models that pushed back sometimes showed “shaky” attempts to accommodate the flawed premise.
  • Smaller Models Lag: Many smaller models performed terribly, seemingly willing to respond to anything.

The “Thinking” Paradox 🤔:

Peter also explored whether “thinking” or reasoning capabilities improve performance on these nonsensical tasks. The data suggests the opposite: reasoning often makes performance worse, not better. He observed that even when models questioned the premise, they would still spend extensive effort trying to solve the nonsensical problem, a behavior he attributes to training focused on solving tasks “at any cost.”

Arena’s Data: Unpacking User Dissatisfaction 📉

Moving beyond his personal benchmark, Peter leverages the vast dataset from Arena, where over 5.5 million user votes have been cast. Arena’s unique “battle mode” allows users to compare responses from two anonymous models and, crucially, to flag when both responses are bad. This “dissatisfaction rate” offers a powerful, real-world metric.

Key Insights from Arena Data:

  • The “Dissatisfaction Rate” Metric: This measures the percentage of times users found both model responses unsatisfactory.
  • Overall Improvement, But Not Perfection: While the overall dissatisfaction rate has dropped significantly from around 17% pre-reasoning models to about 9% now, it’s still not zero. This starkly contrasts with the seemingly perfect upward trends in many benchmarks.
  • Category-Specific Performance:
    • Quantitative Tasks (Math, Physics): Showed dramatic improvement, aligning with user experience.
    • Creative Writing: Improved, but the gains were less dramatic.
    • Expert Categories (Finance, Law, Magic - though the latter is unclear): Showed much less steep improvement, suggesting these specialized areas might not have seen the same level of focus or progress.
  • Software Subcategories: Even within software, specific areas like gaming showed a concerning lack of improvement. Peter noted that LLMs often struggle with game mechanics, suggesting a fundamental misunderstanding despite more complex prompts being used over time.

The Gap Between Benchmarks and Reality 🌐

Peter concludes by addressing the disconnect between the “crazy charts” showing linear progress and the more nuanced reality revealed by user dissatisfaction. He suggests that:

  • Benchmarks are Narrow: Standard benchmarks often focus on very specific, well-defined tasks that don’t capture the full spectrum of real-world work.
  • Expectations Shift: As models improve, user expectations also rise, leading to higher dissatisfaction rates even if the model’s absolute performance has improved.
  • The “Fuzziness” Matters: Human judgment in complex, real-world scenarios has a “fuzziness” that current benchmarks don’t fully capture.

Peter urges caution and a greater effort to improve the broader distribution of LLM capabilities, not just the cutting edge. While AI is undoubtedly advancing, understanding its current limitations is crucial for setting realistic expectations and guiding future development.

For those interested in more data and insights, Peter points to Hugging Face, where Arena publishes a lot of their research. 🚀

Appendix