Why did ChatGPT Health under-triage emergencies?

Question

Hans Steiner · Accepted Answer

Study finds the AI underestimated severity in many cases

A structured academic evaluation of OpenAI’s health-focused chatbot tested its triage recommendations against a set of emergency scenarios and found that the system underestimated how urgently many patients needed care. In the study, roughly half of the vignettes representing medical emergencies received advice that downplayed the severity, a pattern experts say could delay critical care for some users.

The investigators presented realistic clinical prompts and compared the chatbot’s disposition recommendations against established triage standards. Under-triaging means advising lower-acuity actions—such as home care or routine outpatient follow-up—when emergency department evaluation or urgent intervention would be appropriate.

Why this matters:

Delayed recognition of emergencies can worsen outcomes for time-sensitive conditions like heart attack, sepsis, stroke, and severe infections.
Millions of people will encounter consumer-facing chatbots; if advice systematically underestimates danger, reliance on these tools could produce widespread harm.

Key implications and next steps

Human oversight: Clinicians should remain central to triage decisions; AI tools can support but not replace judgement.
Transparency and testing: AI health tools need rigorous, independent validation and clear limits communicated to users.
Product updates: Developers must refine models and safety guards to reduce under-triage errors and escalate high-risk flags.

The study underscores that while generative AI can help expand access to health information, its current triage performance requires caution. Regulators, developers, and health systems will need to set stronger safety standards and monitor real-world outcomes as these tools scale.