LLMs hallucinate when removing patient info from EPR, finds study
- 18 December 2025
- A study found that AI tools sometimes produce hallucinations when asked to remove personal patient information from EPRs
- Researchers evaluated the ability of LLMs to detect and remove patient data from real-world records, without altering clinical content
- Smaller LLMs frequently over-redacted or produced erroneous text not present in the original record
AI tools sometimes produce hallucinations when asked to remove personal patient information from electronic patient records (EPRs), a study has found.
Researchers from the University of Oxford evaluated the ability of large language models (LLMs) and purpose-built software tools to detect and remove patient names, dates, medical record numbers, and other identifiers from real-world records, without altering clinical content.
The study, published by iScience on 9 December 2025, found that smaller LLMs frequently over-redacted or produced hallucinatory content, in which erroneous text not present in the original record was shown, or occasionally introducing fabricated medical details.
āHallucinations, particularly those that fabricate clinical information, pose a non-trivial risk to the integrity of downstream research.
āWe suggest future research focusing on systematic, scalable techniques to detect and supress hallucinations, especially in zero- and few-shot scenarios,ā the study says.
Firstly, the researchers tested the ability of a human to anonymise the data by manually redacting 3,650 medical records, comparing and correcting the data until they had a complete set to use as a benchmark.
They then compared two task-specific de-identification software tools (Microsoft Azure and AnonCAT) and five general-purpose LLMs, including GPT-4, GPT-3.5, Llama-3, Phi-3, and Gemma for redacting identifiable information.
Dr Andrew Soltan, academic clinical lecturer in oncology at the University of Oxford and engineering research fellow, said: āWhile some large language models perform impressively, others can generate false or misleading text.
āThis behaviour poses a risk in clinical contexts, and careful validation is critical before deployment.ā
The researchers concluded that automating de-identification could significantly reduce the time and cost required to prepare clinical data for research, while maintaining patient privacy in compliance with data protection regulations.
Microsoftās Azure de-identification service achieved the highest performance overall, closely matching human reviewers. GPT-4 also performed strongly, demonstrating that modern language models can accurately remove identifiers with minimal fine-tuning or task-specific training.
Dr Soltan added: āOne of our most promising findings was that we don’t need to retrain complex AI models from scratch.
āWe found thatĀ someĀ modelsĀ worked wellĀ out-of-the-box, and thatĀ others saw theirĀ performance nudged upwardsĀ withĀ simpleĀ techniques.
āFor the general-purpose models, this meant showing themĀ justĀ a handful of examplesĀ of what a correctly anonymised record looks like.
āFor the specialised software,Ā one model learnedĀ toĀ pick up nuances in our hospitalāsĀ data,Ā like the format ofĀ telephone extensions,Ā afterĀ fine-tuningĀ onĀ justĀ a small sample.
āThis is exciting because it shows a practical path for hospitals to adoptĀ these technologiesĀ without manually labelling thousands of patient notes.ā
Professor David Eyre, professor of infectious diseases at Oxford Population Health and the Big Data Institute, said: “This work shows that AI can be a powerful ally in protecting patient confidentiality.
“But human judgement and strong governance must remain at the centre of any system that handles patient data.”
The study was supported by the National Institute for Health and Care Research (NIHR), Microsoft Research UK, Cancer Research UK, the EPSRC, and the NIHR Oxford Biomedical Research Centre.
