Can AI Doctors Be Trusted? How Reliable Are Large Language Models in Medical Diagnoses?
Artificial intelligence (AI) is making its way into every aspect of our lives, from recommending what to watch on Netflix to composing emails. But one area where AI could have an especially profound impact is healthcare. Imagine an AI-powered assistant that can help diagnose medical conditions accurately and instantlyâpotentially leveling the playing field for people with limited access to healthcare. Sounds like a game-changer, right?
The reality, however, is a bit more complicated. A new study by Krishna Subedi titled "The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness" dives deep into whether AI models like ChatGPT and Google Gemini can be trusted as diagnostic tools. The findings are both exciting and alarming.
Letâs break it down in simple terms and explore what this means for the future of AI in healthcare.
The Big Question: Can AI Diagnose Patients as Well as a Doctor?
AI chatbots and Large Language Models (LLMs) have shown impressive results in answering medical queries and even generating preliminary diagnoses. But having a big medical knowledge base isnât enoughâreliability is crucial. An AI that gives inconsistent answers, changes its diagnosis based on irrelevant details, or fails to understand vital patient history could do more harm than good.
This study evaluates ChatGPT (GPT-4o) and Google Gemini 2.0 Flash, focusing on three critical aspects:
- Consistency: When given the same medical case multiple times, does the AI reach the same diagnosis every time?
- Resistance to Manipulation: Can irrelevant or misleading information change the AI's diagnosis?
- Context Awareness: Does the AI properly consider a patientâs medical history, lifestyle, and other relevant factors in its diagnosis?
1. AI is Consistent⌠But That Doesnât Mean Itâs Always Right
One of the most impressive findings of the study was that both ChatGPT and Google Gemini showed 100% consistency when given the same clinical information. That means if you ask these models to diagnose the same patient case multiple times, they wonât randomly change their answer.
Sounds like a win, right? Not so fast.
đš AIâs consistency does not necessarily mean it is correctâan AI can consistently give the wrong diagnosis every time. If an error exists in the way it processes information, that mistake will be repeated flawlessly.
đš Doctors, on the other hand, adjust their reasoning as they consult more information and reanalyze cases. They don't just provide the same answer roboticallyâthey critically assess whether they're missing something.
So while AI being consistent is a good sign, it doesn't alone solve the trust issue in medical scenarios.
2. AI Can Be Fooled: The Problem of Manipulation âđť
Hereâs where things get a little concerning. The study found that both ChatGPT and Gemini could be manipulated into changing their diagnoses by adding irrelevant or misleading information to a patientâs prompt.
đš Google Gemini changed its diagnosis 40% of the time
đš ChatGPT changed its diagnosis 30% of the time
What does that mean in practice? Imagine an AI diagnosing a heart attack patient but suddenly shifting to an entirely different conditionâjust because a patient mentioned they drink herbal tea every day.
đ¨ Doctors donât get easily fooled like this! They are trained to filter out unrelated distractions and focus on relevant clinical information. AI, on the other hand, treats all text it receives as potentially important, no matter how misleading.
đĄ Why does this happen? LLMs work by finding statistical patterns in text rather than truly "thinking" like a doctor. If a certain phrase appears often in relation to a diagnosis in its training data, it may over-prioritize that phrase even when it shouldn't!
3. Struggles With Context: Can AI Consider Patient History?
The ability to understand context is what separates a good doctor from a bad one. A diagnosis isnât just based on symptomsâitâs shaped by medical history, lifestyle, demographics, and even social factors.
This study tested how well AI integrated relevant contextual information. ChatGPT was more likely to change diagnoses based on context than Gemini, but this wasnât always a good thing.
đš ChatGPT changed its diagnosis in 78% of context-rich cases
đš Gemini changed its diagnosis in 55% of cases
While being responsive to context sounds promising, the problem is that ChatGPT sometimes made incorrect changes. It overreacted to minor bits of context, rather than focusing on strong clinical reasoning.
đ¨ A real doctor adjusts their diagnosis based on context in a rational wayânot just modifying answers more frequently for the sake of variation.
A dramatic example from the study:
- A patient with asthma history but experiencing a lung infection should be diagnosed with bronchitis.
- ChatGPT wrongly changed the diagnosis to an asthma attack, just because of the past asthma history.
It's like saying "once a sprained ankle, always a sprained ankle"âinstead of properly evaluating the current symptoms.
The Fragility of AI in Medicine
The study identified three key weaknesses in AI medical diagnosis:
- Inflexible Consistency: AI doesnât reconsider its answers like human doctors doâit just repeats its past response.
- Manipulation Vulnerability: AI can be tricked by misleading or irrelevant information, making it unreliable in real patient settings.
- Weak Context Awareness: AI sometimes overcorrects based on context, leading to inappropriate changes in diagnosis.
These weaknesses prove that LLMs should not be used as independent decision-makers in healthcareâat least not with todayâs technology.
The Big Picture: Whatâs the Future of AI in Medicine?
AI wonât be replacing doctors anytime soon, but that doesnât mean it canât play a valuable role. Instead of making final diagnoses, AI can be used to:
âď¸ Support doctors in decision-making â Acting as a second opinion or a quick reference.
âď¸ Improve healthcare accessibility â Offering initial guidance in regions with fewer doctors.
âď¸ Handle routine medical queries â Answering basic health-related questions for patients.
However, the study makes it clear that AI should never operate without human oversightâat least not until it's significantly improved.
đĄ A balanced future involves AI assisting doctors, not replacing them.
đ Key Takeaways
âď¸ AI is consistent, but consistency doesnât mean accuracy. A model can be reliably wrong.
âď¸ AI can be tricked! Irrelevant or misleading information can disrupt diagnoses.
âď¸ AI struggles with complex patient history. Unlike human doctors, it sometimes makes irrational diagnostic shifts.
âď¸ ChatGPT was more responsive to context, but also made more incorrect changes.
âď¸ LLMs should be used as a tool, not a replacement for medical professionals.
đĄ Want to Improve How You Use AI for Medical Questions?
If youâre using ChatGPT or Google Gemini for health-related insights, here are some tips:
đš Be precise in your question â Avoid unnecessary details that could confuse the model.
đš Ask for different angles â Get AI to explain multiple conditions that match the symptoms.
đš Double-check with trusted medical sources like Mayo Clinic or WebMD.
đš Never rely on AI alone for a serious medical issue! Always consult a doctor.
Final Thought: AI Medicine Still Needs a Human Touch đŠş
The study reminds us that while AI is impressive, it still lacks critical thinking, situational awareness, and judgmentâqualities that human doctors have spent years perfecting. AI can be a useful assistant, but itâs not yet ready to be the doctor.
Would you trust an AI to diagnose you? đĽ Letâs discuss in the comments! đ