Destination

2025-12-26

New benchmark shows LLMs still can't do real scientific research


Getting top marks on exams doesn't automatically make you a good researcher. A new study shows this academic truism applies to large language models too.


The article New benchmark shows LLMs still can't do real scientific research appeared first on The Decoder.

[...]

Rating

Innovation

Pricing

Technology

Usability

We have discovered similar tools to what you are looking for. Check out our suggestions for similar AI tools.

venturebeat

2025-10-16

Amazon and Chobani adopt Strella's AI interviews for customer research as fast-growing startup raises $14M

One year after emerging from stealth, Strella has raised $14 million in Series A funding to expand its AI-powered customer research platform, the company announced Thursday. The round, led by Bessemer [...]

Match Score: 75.80

venturebeat

2025-12-10

The 70% factuality ceiling: why Google’s new ‘FACTS’ benchmark is a wake-up call for enterprise AI

There's no shortage of generative AI benchmarks designed to measure the performance and accuracy of a given model on completing various helpful enterprise tasks — from coding to instruction fol [...]

Match Score: 70.95

venturebeat

2025-11-03

Meet Denario, the AI ‘research assistant’ that is already getting its own papers published

An international team of researchers has released an artificial intelligence system capable of autonomously conducting scientific research across multiple disciplines — generating papers from initia [...]

Match Score: 66.15

venturebeat

2025-12-16

Zoom says it aced AI’s hardest exam. Critics say it copied off its neighbors.

Zoom Video Communications, the company best known for keeping remote workers connected during the pandemic, announced last week that it had achieved the highest score ever recorded on one of artificia [...]

Match Score: 65.93

venturebeat

2025-11-25

What enterprises should know about The White House's new AI 'Manhattan Project' the Genesis Mission

President Donald Trump’s new “Genesis Mission” unveiled Monday is billed as a generational leap in how the United States does science akin to the Manhattan Project that created the atomic bomb d [...]

Match Score: 58.70

venturebeat

2025-11-13

Upwork study shows AI agents excel with human partners but fail independently

Artificial intelligence agents powered by the world's most advanced language models routinely fail to complete even straightforward professional tasks on their own, according to groundbreaking re [...]

Match Score: 54.86

venturebeat

2025-12-01

OpenAGI emerges from stealth with an AI agent that it claims crushes OpenAI and Anthropic

A stealth artificial intelligence startup founded by an MIT researcher emerged this morning with an ambitious claim: its new AI model can control computers better than systems built by OpenAI and Anth [...]

Match Score: 54.73

venturebeat

2025-12-09

Databricks' OfficeQA uncovers disconnect: AI agents ace abstract tests but stall at 45% on enterprise docs

There is no shortage of AI benchmarks in the market today, with popular options like Humanity's Last Exam (HLE), ARC-AGI-2 and GDPval, among numerous others.AI agents excel at solving abstract ma [...]

Match Score: 54.23

venturebeat

2025-11-20

Grok 4.1 Fast's compelling dev access and Agent Tools API overshadowed by Musk glazing

Elon Musk's frontier generative AI startup xAI formally opened developer access to its Grok 4.1 Fast models last night and introduced a new Agent Tools API—but the technical milestones were imm [...]

Match Score: 54.16