Peektastic.com

venturebeat

Gemini 3 Pro scores 69% trust in blinded testing up from 16% for Gemini 2.5: The case for evaluating AI on real-world trust, not academic benchmarks

Just a few short weeks ago, Google debuted its Gemini 3 model, claiming it scored a leadership position in multiple AI benchmarks. But the challenge with vendor-provided benchmarks is that they are just that — vendor-provided. A new vendor-neutral evaluation from Prolific, however, puts Gemini 3 at the top of the leaderboard. This isn't on a set of academic benchmarks; rather, it's on a set of real-world attributes that actual users and organizations care about. Prolific was founded by researchers at the University of Oxford. The company delivers high-quality, reliable human data to power rigorous research and ethical AI development. The company's “HUMAINE benchmark” applies this approach by using representative human sampling and blind testing to rigorously compare A [...]

Discover Copy

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat

AI IQ is here: a new site scores frontier AI models on the human IQ scale. The results are already dividing tech.

For decades, the IQ test has been one of the most familiar — and most contested — yardsticks for human intelligence. Now, a startup project called AI IQ is applying the same metaphor to artificial [...]

More Copy

Match Score: 109.93

venturebeat

Artificial Analysis overhauls its AI Intelligence Index, replacing popular benchmarks with 'real-world' tests

The arms race to build smarter AI models has a measurement problem: the tests used to rank them are becoming obsolete almost as quickly as the models improve. On Monday, Artificial Analysis, an indepe [...]

More Copy

Match Score: 74.31

venturebeat

Why Weibo’s tiny VibeThinker-3B has the AI world arguing over benchmarks again

On Sunday, a team of nine researchers at Sina Weibo — the Chinese social media giant better known for its microblogging platform than for cutting-edge artificial intelligence — quietly posted a 14 [...]

More Copy

Match Score: 70.32

venturebeat

Google's Gemini Flash 5.6 model cuts AI agent token costs by up to 65% on long horizon engineering tasks —and 3.5 Pro is on the way

Google DeepMind today released three new proprietary AI models it says are among its most token-efficient yet: Gemini 3.6 Flash, Gemini 3.5 Flash-Lite, and Gemini 3.5 Flash Cyber. The models aim to ma [...]

More Copy

Match Score: 66.56

venturebeat

Google's Gemini 3.6 Flash model cuts AI agent token costs by up to 65% on long horizon engineering tasks —and 3.5 Pro is on the way

More Copy

Match Score: 66.56

venturebeat

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude [...]

More Copy

Match Score: 60.63

venturebeat

Frontier models are failing one in three production attempts — and getting harder to audit

AI agents are now embedded in real enterprise workflows, and they're still failing roughly one in three attempts on structured benchmarks. That gap between capability and reliability is the defin [...]

More Copy

Match Score: 57.04

venturebeat

Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI benchmarks

After more than a month of rumors and feverish speculation — including Polymarket wagering on the release date — Google today unveiled Gemini 3, its newest proprietary frontier model family and th [...]

More Copy

Match Score: 55.68

venturebeat

Google’s new AI agent can draft your emails, monitor your inbox and eventually spend your money

Google on Tuesday unveiled Gemini Spark, a personal AI agent designed to work around the clock — drafting emails, assembling documents, monitoring inboxes, and eventually making purchases — even w [...]

More Copy

Match Score: 53.27