Destination

2025-11-19

Gemini 3 Pro tops new AI reliability benchmark, but hallucination rates remain high

Stylized illustration of a hollow book in front of a square grid and curved wire frame


A new benchmark from Artificial Analysis reveals alarming weaknesses in the factual reliability of large language models. Out of 40 models tested, only four achieved a positive score - with Google's Gemini 3 Pro clearly in the lead.


The article Gemini 3 Pro tops new AI reliability benchmark, but hal [...]

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat

2025-11-18

Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI benchmarks

After more than a month of rumors and feverish speculation — including Polymarket wagering on the release date — Google today unveiled Gemini 3, its newest proprietary frontier model family and th [...]

Match Score: 142.69

venturebeat

2025-11-20

Grok 4.1 Fast's compelling dev access and Agent Tools API overshadowed by Musk glazing

Elon Musk's frontier generative AI startup xAI formally opened developer access to its Grok 4.1 Fast models last night and introduced a new Agent Tools API—but the technical milestones were imm [...]

Match Score: 115.04

venturebeat

2025-11-18

Musk's xAI launches Grok 4.1 with lower hallucination rate on the web and apps — no API access (for now)

In what appeared to be a bid to soak up some of Google's limelight prior to the launch of its new Gemini 3 flagship AI model — now recorded as the most powerful LLM in the world by multiple ind [...]

Match Score: 97.18

Destination

2025-11-18

Google's new Gemini 3 model arrives in AI Mode and the Gemini app

A few weeks short of Gemini 2's first birthday, Google has announced Gemini 3 Pro. Naturally, the company claims the new system is its most intelligent AI model yet, offering state-of-the-art rea [...]

Match Score: 76.65

venturebeat

2025-11-20

Google's upgraded Nano Banana Pro AI image model hailed as 'absolutely bonkers' for enterprises and users

Infographics rendered without a single spelling error. Complex diagrams one-shotted from paragraph prompts. Logos restored from fragments. And visual outputs so sharp with so much text density and acc [...]

Match Score: 68.79

venturebeat

2025-10-07

Google's AI can now surf the web for you, click on buttons, and fill out forms with Gemini 2.5 Computer Use

Some of the largest providers of large language models (LLMs) have sought to move beyond multimodal chatbots — extending their models out into "agents" that can actually take more actions [...]

Match Score: 64.91

venturebeat

2025-11-13

Upwork study shows AI agents excel with human partners but fail independently

Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking re [...]

Match Score: 64.04

venturebeat

2025-11-23

Lean4: How the theorem prover works and why it's the new competitive edge in AI

Large language models (LLMs) have astounded the world with their capabilities, yet they remain plagued by unpredictability and hallucinations – confidently outputting incorrect information. In high- [...]

Match Score: 60.08

venturebeat

2025-11-07

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

The developers of Terminal-Bench, a benchmark suite for evaluating the performance of autonomous AI agents on real-world terminal-based tasks, have released version 2.0 alongside Harbor, a new framewo [...]

Match Score: 59.70