2025-07-20
ARC-AGI-3 aims to test how well AI systems can handle brand new problems. While people breeze through the challenges, the latest AI models still come up short.
The article New ARC-AGI-3 benchmark shows that humans still outperform LLMs at pretty basic thinking appeared first on THE DECODER.
[...]2025-08-07
In the ARC-AGI-2 benchmark, which is designed to measure a language model's general reasoning skills, GPT-5 (High) scored 9.9 percent at a cost of $0.73 per task, according to ARC Prize.<br /& [...]
2025-03-26
The new AI benchmark ARC-AGI-2 significantly raises the bar for AI tests. While humans can easily solve the tasks, even highly developed AI systems such as OpenAI o3 clearly fail.<br /> The arti [...]
2025-05-13
OpenAI has released a new benchmark for testing AI systems in healthcare. Called HealthBench, it's designed to evaluate how well language models handle realistic medical conversations. According [...]
2025-05-27
The Browser Company has stopped active development of the popular Arc web browser, according to a blog post from CEO Josh Miller. There will still be updates to fix security issues and the like, but t [...]
2025-10-01
Thinking Machines, the AI startup founded earlier this year by former OpenAI CTO Mira Murati, has launched its first product: Tinker, a Python-based API designed to make large language model (LLM) fin [...]
2025-10-02
IBM today announced the release of Granite 4.0, the newest generation of its homemade family of open source large language models (LLMs) designed to balance high performance with lower memory and cost [...]