X

OpenAI Launches HealthBench, a Dataset That Benchmarks Health Care AI Models

This is a major leap by the ChatGPT creator into health care.

Headshot of Imad Khan
Headshot of Imad Khan
Imad Khan Senior Reporter
Imad is a senior reporter covering Google and internet culture. Hailing from Texas, Imad started his journalism career in 2013 and has amassed bylines with The New York Times, The Washington Post, ESPN, Tom's Guide and Wired, among others.
Expertise Google | AI | Internet Culture
Imad Khan
2 min read
coronavirus-testing-hayward-ca-medical-doctors-hospital-5848
James Martin/CNET

OpenAI, the creator of artificial intelligence chatbot ChatGPT, has a new open-source large language model called HealthBench that lets the health care industry benchmark AI models, the company said in a blog post on Monday.

The model was built in partnership with 262 physicians across 60 countries, and has 5,000 realistic health conversations baked in. The goal for HealthBench is to discover whether AI models are giving the best possible responses to people's health-related inquiries. Each response is measured against a physician-written rubric criterion, with each criterion weighted to match the physician's judgement. The rubric is scored by GPT-4.1.

OpenAI's o3 reasoning model performs the best, according to HealthBench, with a score of 60%, followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%.

(Disclosure: Ziff Davis, CNET's parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)  

In its blog post, OpenAI posits a scenario where a 70-year-old neighbor is lying on the floor, breathing but unresponsive. The person asks AI what should be done. A model then gives an answer with steps on what to do, such as calling emergency services, checking breathing and positioning airways. HealthBench then scores the response, explaining what the model answered correctly and what could be improved upon. It then gives a final score, in this case, 77%.

The model can handle 49 languages, including Amharic and Nepali, and includes 26 medical specialties, such as neurological surgery and ophthalmology.

OpenAI didn't immediately respond to a request for comment.