Can AI Doctors Be Trusted? How Reliable Are Large Language Models in Medical Diagnoses?

Artificial intelligence (AI) is making its way into every aspect of our lives, from recommending what to watch on Netflix to composing emails. But one area where AI could have an especially profound impact is healthcare. Imagine an AI-powered assistant that can help diagnose medical conditions accurately and instantly—potentially leveling the playing field for people with limited access to healthcare. Sounds like a game-changer, right?

The reality, however, is a bit more complicated. A new study by Krishna Subedi titled "The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness" dives deep into whether AI models like ChatGPT and Google Gemini can be trusted as diagnostic tools. The findings are both exciting and alarming.

Let’s break it down in simple terms and explore what this means for the future of AI in healthcare.

The Big Question: Can AI Diagnose Patients as Well as a Doctor?

AI chatbots and Large Language Models (LLMs) have shown impressive results in answering medical queries and even generating preliminary diagnoses. But having a big medical knowledge base isn’t enough—reliability is crucial. An AI that gives inconsistent answers, changes its diagnosis based on irrelevant details, or fails to understand vital patient history could do more harm than good.

This study evaluates ChatGPT (GPT-4o) and Google Gemini 2.0 Flash, focusing on three critical aspects:

Consistency: When given the same medical case multiple times, does the AI reach the same diagnosis every time?
Resistance to Manipulation: Can irrelevant or misleading information change the AI's diagnosis?
Context Awareness: Does the AI properly consider a patient’s medical history, lifestyle, and other relevant factors in its diagnosis?

1. AI is Consistent… But That Doesn’t Mean It’s Always Right

One of the most impressive findings of the study was that both ChatGPT and Google Gemini showed 100% consistency when given the same clinical information. That means if you ask these models to diagnose the same patient case multiple times, they won’t randomly change their answer.

Sounds like a win, right? Not so fast.

🔹 AI’s consistency does not necessarily mean it is correct—an AI can consistently give the wrong diagnosis every time. If an error exists in the way it processes information, that mistake will be repeated flawlessly.

🔹 Doctors, on the other hand, adjust their reasoning as they consult more information and reanalyze cases. They don't just provide the same answer robotically—they critically assess whether they're missing something.

So while AI being consistent is a good sign, it doesn't alone solve the trust issue in medical scenarios.

2. AI Can Be Fooled: The Problem of Manipulation ‍💻

Here’s where things get a little concerning. The study found that both ChatGPT and Gemini could be manipulated into changing their diagnoses by adding irrelevant or misleading information to a patient’s prompt.

🔹 Google Gemini changed its diagnosis 40% of the time
🔹 ChatGPT changed its diagnosis 30% of the time

What does that mean in practice? Imagine an AI diagnosing a heart attack patient but suddenly shifting to an entirely different condition—just because a patient mentioned they drink herbal tea every day.

🚨 Doctors don’t get easily fooled like this! They are trained to filter out unrelated distractions and focus on relevant clinical information. AI, on the other hand, treats all text it receives as potentially important, no matter how misleading.

💡 Why does this happen? LLMs work by finding statistical patterns in text rather than truly "thinking" like a doctor. If a certain phrase appears often in relation to a diagnosis in its training data, it may over-prioritize that phrase even when it shouldn't!

3. Struggles With Context: Can AI Consider Patient History?

The ability to understand context is what separates a good doctor from a bad one. A diagnosis isn’t just based on symptoms—it’s shaped by medical history, lifestyle, demographics, and even social factors.

This study tested how well AI integrated relevant contextual information. ChatGPT was more likely to change diagnoses based on context than Gemini, but this wasn’t always a good thing.

🔹 ChatGPT changed its diagnosis in 78% of context-rich cases
🔹 Gemini changed its diagnosis in 55% of cases

While being responsive to context sounds promising, the problem is that ChatGPT sometimes made incorrect changes. It overreacted to minor bits of context, rather than focusing on strong clinical reasoning.

🚨 A real doctor adjusts their diagnosis based on context in a rational way—not just modifying answers more frequently for the sake of variation.

A dramatic example from the study:
- A patient with asthma history but experiencing a lung infection should be diagnosed with bronchitis.
- ChatGPT wrongly changed the diagnosis to an asthma attack, just because of the past asthma history.

It's like saying "once a sprained ankle, always a sprained ankle"—instead of properly evaluating the current symptoms.

The Fragility of AI in Medicine

The study identified three key weaknesses in AI medical diagnosis:

Inflexible Consistency: AI doesn’t reconsider its answers like human doctors do—it just repeats its past response.
Manipulation Vulnerability: AI can be tricked by misleading or irrelevant information, making it unreliable in real patient settings.
Weak Context Awareness: AI sometimes overcorrects based on context, leading to inappropriate changes in diagnosis.

These weaknesses prove that LLMs should not be used as independent decision-makers in healthcare—at least not with today’s technology.

The Big Picture: What’s the Future of AI in Medicine?

AI won’t be replacing doctors anytime soon, but that doesn’t mean it can’t play a valuable role. Instead of making final diagnoses, AI can be used to:

✔️ Support doctors in decision-making – Acting as a second opinion or a quick reference.
✔️ Improve healthcare accessibility – Offering initial guidance in regions with fewer doctors.
✔️ Handle routine medical queries – Answering basic health-related questions for patients.

However, the study makes it clear that AI should never operate without human oversight—at least not until it's significantly improved.

💡 A balanced future involves AI assisting doctors, not replacing them.

🚀 Key Takeaways

✔️ AI is consistent, but consistency doesn’t mean accuracy. A model can be reliably wrong.
✔️ AI can be tricked! Irrelevant or misleading information can disrupt diagnoses.
✔️ AI struggles with complex patient history. Unlike human doctors, it sometimes makes irrational diagnostic shifts.
✔️ ChatGPT was more responsive to context, but also made more incorrect changes.
✔️ LLMs should be used as a tool, not a replacement for medical professionals.

💡 Want to Improve How You Use AI for Medical Questions?

If you’re using ChatGPT or Google Gemini for health-related insights, here are some tips:

🔹 Be precise in your question – Avoid unnecessary details that could confuse the model.
🔹 Ask for different angles – Get AI to explain multiple conditions that match the symptoms.
🔹 Double-check with trusted medical sources like Mayo Clinic or WebMD.
🔹 Never rely on AI alone for a serious medical issue! Always consult a doctor.

Final Thought: AI Medicine Still Needs a Human Touch 🩺

The study reminds us that while AI is impressive, it still lacks critical thinking, situational awareness, and judgment—qualities that human doctors have spent years perfecting. AI can be a useful assistant, but it’s not yet ready to be the doctor.

Would you trust an AI to diagnose you? 🏥 Let’s discuss in the comments! 👇