A new benchmark from researchers in Switzerland and Germany shows that even top models like Claude Opus 4.5 with web search enabled still produce incorrect information in nearly a third of all cases.<br /> The article New benchmark shows AI models still hallucinate far too often appeared first on The Decoder. [...]
AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defin [...]
There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction fol [...]
A growing number of developers and AI power users are taking to social media to accuse Anthropic of degrading the performance of Claude Opus 4.6 and Claude Code — intentionally or as an outcome of c [...]
Elon Musk's frontier generative AI startup xAI formally opened developer access to its Grok 4.1 Fast models last night and introduced a new Agent Tools API—but the technical milestones were imm [...]
Zoom Video Communications, the company best known for keeping remote workers connected during the pandemic, announced last week that it had achieved the highest score ever recorded on one of artificia [...]
When large language models hallucinate, they leave measurable traces in their own computations. Researchers at the Sapienza University of Rome have developed a training-free method that picks up on th [...]