OpenAI Launches HealthBench, a Dataset That Benchmarks Health Care AI Models

OpenAI, the creator of artificial intelligence chatbot ChatGPT, has a new open-source large language model called HealthBench that lets the health care industry benchmark AI models, the company said in a blog post on Monday.

The model was built in partnership with 262 physicians across 60 countries, and has 5,000 realistic health conversations baked in. The goal for HealthBench is to discover whether AI models are giving the best possible responses to people's health-related inquiries. Each response is measured against a physician-written rubric criterion, with each criterion weighted to match the physician's judgement. The rubric is scored by GPT-4.1.

OpenAI's o3 reasoning model performs the best, according to HealthBench, with a score of 60%, followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%.

(Disclosure: Ziff Davis, CNET's parent company, in April filed a lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.)

In its blog post, OpenAI posits a scenario where a 70-year-old neighbor is lying on the floor, breathing but unresponsive. The person asks AI what should be done. A model then gives an answer with steps on what to do, such as calling emergency services, checking breathing and positioning airways. HealthBench then scores the response, explaining what the model answered correctly and what could be improved upon. It then gives a final score, in this case, 77%.

The model can handle 49 languages, including Amharic and Nepali, and includes 26 medical specialties, such as neurological surgery and ophthalmology.

OpenAI didn't immediately respond to a request for comment.