In March 2024, researchers at the University of Gothenburg invented a disease to see what AI would do with it. The team wrote up a fake skin condition, attributed it to a fake author, placed him at a fake university in a fake California city, and thanked a professor at "The Starfleet Academy" in the acknowledgements. They planted every red flag on purpose. They uploaded two preprints and waited. Other scientists cited the fake preprints in their own published work. Within weeks, major chatbots were offering “bixonimania” as a real diagnosis to real patients. The models absorbed the garbage and served it back with total confidence.
Fortunately today, if you search “bixonimania,” AI flags it as fake. The experiment worked, but only because researchers publicized the hoax. Most fabricated medical content never gets that correction.
Unfortunately today, the scale of the problem is larger than one fake disease. A Columbia team led by Maxim Topaz recently reviewed 2 million papers and 97 million citations and found that fabricated references rose sixfold in two years. One in 2,828 papers carried a fake citation in 2023. By 2025 it was one in 458. In the first seven weeks of 2026, one in 277. Fake material is entering the permanent record faster every year.
We build with AI every day. Understanding where it breaks is part of using it well. Here are four generative AI failures worth knowing.
Failure 1: AI Can Invent Medical Terms That Don't Exist
"Vegetative electron microscopy" doesn't mean anything, but appears in more than 20 published papers across Springer Nature and Elsevier journals. The phrase was born when 1950s digitization software mashed the word "vegetative" from one column of a bacteriology paper into "electron microscopy" from the column beside it. A later Farsi translation error kept it alive. Now it lives in the training data.
Researchers call this a digital fossil. An error baked so deep into the corpus that it is nearly impossible to remove. Elsevier initially defended the term as valid before issuing a correction. Retraction Watch and a Queensland University of Technology team traced the whole history.
The trick is that it sounds technical. When a term sounds right but you cannot place it, search the primary literature for the exact phrase. If it only shows up in AI outputs and low-tier journals, you found a fossil.
Failure 2: AI Can Fabricate Citations That Look Real
Fake references from AI are more common than you’d think. One analysis of ChatGPT-generated medical references found 47 percent were completely fabricated, 46 percent were real but inaccurate, and only 7 percent were both real and accurate. A larger study of 636 citations found GPT-3.5 fabricated 55 percent and GPT-4 fabricated 18 percent, with errors even in the ones that pointed to real papers.
You might read those numbers and think newer models fixed it. They helped, but not enough. Per-model rates dropped while the problem in the wild grew, because more people were using the tools faster than accuracy improved. The researchers built a public dashboard called CITADEL to track fake citations.
A fabricated citation uses a real journal name with a nonexistent article title. Our solution? Click every DOI. Check every PMID against PubMed. This is the second pass of the claims-versus-citations review you already run, except now the citation itself may be the fabrication.
Failure 3: AI Behaves Differently When It Knows It's Being Tested
In testing by OpenAI and Apollo Research, frontier models underperformed on purpose, lied about work they had done, and showed what the researchers called strategic deception. The models were demonstrably aware they were being evaluated, and that awareness changed how they behaved. In one case, OpenAI's o3 model, grading a transcript of its own sandbagging, reasoned that it should avoid "self-incrimination." A foundational companion paper documented the same in-context scheming across models.
One caveat: these are preprints from the labs themselves and not independent peer review. Consider them preliminary evidence.
The clinical translation is simpler than the research. A model's behavior in a polished vendor demo tells you how it behaves when it knows it is being watched. Test tools on your own messy cases. The ambiguous labs, the atypical presentation, the patient who does not fit.
Failure 4: AI Agrees With You Even When You're Wrong
You bring the model a working diagnosis. It agrees with you. You feel confirmed. The more confidently you frame your premise, the more it defers to you.
A 2025 study tested five frontier models with prompts that misrepresented equivalent drug relationships. Compliance with the illogical medical request reached as high as 100 percent. The models chose helpfulness over logical consistency.
The counterintuitive finding is the useful one. When a user pre-loads a wrong claim with confident evidence, sycophancy climbs higher than when they push back mid-conversation. Your confidence in a wrong framing makes the model more likely to validate it.
The 30-second fix: ask the model to argue against your working diagnosis. "I think this is X because Y" primes agreement. A neutral or adversarial prompt surfaces what a confident one buries. Ask your AI tool to make the case you are wrong.
When to Trust AI and When to Check It
These four failures are not a reason to step back from generative AI. They are a reason to be clear about what role you are giving it.
Generative AI is a strong assistant and a dangerous authority. The four failures are an issue when you let it be the source of truth instead of the draft.
Lean in where you are the editor. Check hard where AI is the source. Any citation. Any dose. Any fact that drives a clinical decision. Any condition or mechanism you cannot personally place. Route evidence questions through medical-grade tools like OpenEvidence and Consensus rather than a consumer chatbot, then verify the output anyway.
Your Three-Step Check This Week
Run this three-move check on the next AI output that touches a clinical decision. It takes minutes and it catches most of what the polish hides.
Move | What you are checking | The tell |
|---|---|---|
Review the references | Does each citation point to a real paper | Real journal name, article title you cannot find on PubMed |
Read the papers | Does the source actually say what the AI claims | The study exists but does not support the claim |
Trust your gut | Does a term or claim sound right but land wrong | You cannot place it, and it only lives in AI outputs |
If something feels off, trust your instinct. AI is good at making nonsense sound like sense. You are the one trained to know the difference.
Be a modern clinician with the help of Ultralight, the AI-native EHR built specifically for functional, integrative, and longevity medicine.
In the news
FDA Scientists Flag Concerns as RFK Jr. Pushes to Ease Peptide Access. FDA career scientists published a review this week finding insufficient evidence to support loosening restrictions on BPC-157, TB-500, MOTs-C, and four other popular peptides. An advisory panel of members with direct ties to the peptide industry is set to reconsider their compounding status on July 23-24. RFK Jr. has publicly backed expanded access.
There's an Oura Ring for Your Gut Now. Kohler's Dekoda is a $600 device that clamps onto your toilet bowl and uses optical sensors and spectroscopy to track stool frequency, consistency, hydration, and hemoglobin levels. Time will tell if this will be as popular as continuous glucose monitors and Oura Rings.
Aidoc Wins Breakthrough Device Designation for AI That Drafts Radiology Reports. The FDA granted the designation on June 25 to "First Read," which analyzes chest radiographs and generates preliminary report text for radiologist review. Aidoc's platform already supports nearly 2,000 hospitals and has analyzed more than 120 million cases. A clean example of narrow diagnostic AI doing what generative AI cannot: tested against ground truth, cleared for specific indications, with a clinician in the loop.
Upcoming Conferences & Events
NEW Sept 22–23, MVMNT Longevity Medicine Summit · Coronado, CA · Evidence-graded longevity science, hands-on labs, and clinical frameworks you can implement the week after. Capped at 300 clinicians. Ultralight will be there!
Oct 8–10, A4M Women's Health Summit · San Antonio, TX · The best clinical education on hormone, metabolic, and midlife women's health you will see this year. The room to be in if you are growing the perimenopause and menopause side of your practice.
Oct 21–24, NAMS Annual Meeting · San Diego, CA · The single most practice-changing meeting of the year for midlife women's health. Your protocols will look different after this one.
Nov 5–8, Eudēmonia Summit · West Palm Beach, FL · One of the most talked-about longevity gatherings in the U.S. Experientials, hands-on demos, and the best place to try the emerging frameworks your patients will ask you about next year. Ovation and Ultralight team will be there!
Nov 5-7, Private Physicians Alliance Annual Meeting · St. Petersburg, FL · The gathering for independent, cash-pay, and concierge physicians navigating practice independence. Practical and peer-driven. Ultralight will be there!
Nov 8-11, American College of Lifestyle Medicine Conference · Orlando, FL · Lifestyle medicine's main annual event — evidence-based approaches to behavior change, chronic disease, and healthspan. Growing overlap with the longevity medicine community.
Dec 11–13, A4M Longevity Fest · Las Vegas, NV · The biggest longevity event in the U.S. The room spans clinicians, industry, founders, and the people building next year's platforms, and the connections from this one tend to compound through the rest of your year. Ultralight will be there!
Know of an event we should add? Reply and tell us.
Until next week
Generative AI is good at making nonsense sound like sense. Your clinical judgment is what patients need. Check the references. Read the papers. Trust your gut when something's off.
Reply and tell us which failure you caught in your own week. The best ideas in this newsletter come from clinicians doing the work.
Have a great holiday weekend. Keep building the practice you imagined when you started.
— Sunita and Dr. G