How often did ChatGPT Health miss emergencies?

Question

Hans Steiner · Accepted Answer

Independent safety review raises red flags

Researchers at the Icahn School of Medicine at Mount Sinai performed what they describe as the first independent safety evaluation of OpenAI’s ChatGPT Health assistant and found significant mismatches between the model’s triage judgements and clinical priorities. In laboratory-style scenarios, the assistant underestimated the severity of medical emergencies in 51.6% of cases and tended to overcall nonurgent problems as more serious 64.8% of the time.

Those numbers point to two different patient-safety risks. Undertriage — labeling a dangerous condition as low urgency — can delay time‑sensitive care for conditions where minutes or hours matter. Overtriage has its own costs: unnecessary emergency visits, added clinician workload, and wasted system capacity that could slow care for people with true emergencies.

Practical implications for patients and providers

Patients should treat AI symptom checkers as advisory, not definitive, and err on the side of seeking in-person care for rapidly worsening or severe symptoms.
Clinicians and health systems need to assume these tools will make both kinds of errors and design human oversight into any workflow that uses them.
Regulators and hospitals will likely rethink where chatbot assistants are allowed to make autonomous recommendations and whether independent validation becomes mandatory.

It’s still unclear how the study’s test cases map to real-world use — for example, how typical patient phrasing, follow-up prompts, or built-in safety filters affect real conversations. Nevertheless, the results add to growing evidence that LLM-based medical assistants are a blunt instrument for triage without careful guardrails. Health systems piloting these tools will need robust measurement, rapid feedback loops, and clear escalation pathways to protect patients while harnessing potential efficiencies.