Destination
New benchmark shows AI models still hallucinate far too often

A new benchmark from researchers in Switzerland and Germany shows that even top models like Claude Opus 4.5 with web search enabled still produce incorrect information in nearly a third of all cases.<br /> The article New benchmark shows AI models still hallucinate far too often appeared first on The Decoder. [...]

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat
Frontier models are failing one in three production attempts — and getting harder to audit

AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defin [...]

Match Score: 73.38

blogspot
How I Get Free Traffic from ChatGPT in 2025 (AIO vs SEO)

Three weeks ago, I tested something that completely changed how I think about organic traffic. I opened ChatGPT and asked a simple question: "What's the best course on building SaaS with Wor [...]

Match Score: 70.50

venturebeat
The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction fol [...]

Match Score: 56.48

venturebeat
Is Anthropic 'nerfing' Claude? Users increasingly report performance degradation as leaders push back

A growing number of developers and AI power users are taking to social media to accuse Anthropic of degrading the performance of Claude Opus 4.6 and Claude Code — intentionally or as an outcome of c [...]

Match Score: 54.16

venturebeat
Grok 4.1 Fast's compelling dev access and Agent Tools API overshadowed by Musk glazing

Elon Musk's frontier generative AI startup xAI formally opened developer access to its Grok 4.1 Fast models last night and introduced a new Agent Tools API—but the technical milestones were imm [...]

Match Score: 52.92

venturebeat
Zoom says it aced AI’s hardest exam. Critics say it copied off its neighbors.

Zoom Video Communications, the company best known for keeping remote workers connected during the pandemic, announced last week that it had achieved the highest score ever recorded on one of artificia [...]

Match Score: 49.94

Destination
The best microSD cards in 2025

Most microSD cards are fast enough for boosting storage space and making simple file transfers, but some provide a little more value than others. If you’ve got a device that still accepts microSD ca [...]

Match Score: 49.55

Destination
When language models hallucinate, they leave "spilled energy" in their own math

When large language models hallucinate, they leave measurable traces in their own computations. Researchers at the Sapienza University of Rome have developed a training-free method that picks up on th [...]

Match Score: 48.03

Destination
Confident user prompts make LLMs more likely to hallucinate

Many language models are more likely to generate incorrect information when users request concise answers, according to a new benchmark study.<br /> The article Confident user prompts make LLMs [...]

Match Score: 45.63