A clinician-dominated audience was pitted against the best of large language model intelligence Tuesday afternoon at Digital Health Rewired24, as day one concluded on the AI, Data and Analytics stage with a Live AI bake-off between ChatGPT and human doctors. 

Haris Shuaib, CEO of Newton’s Tree and head of CSC at Guy’s and St Thomas’ NHS Foundation Trust, chaired the session, with a panel including Lia Ali, clinical advisor at NHS England’s Transformation Directorate, Keith Grimes, a digital health and innovation consultant, Annabelle Painter, an NHS doctor and clinical AI fellow at the Artificial Intelligence Centre for Value-Based Healthcare and Michael Watts, associate CCIO at University Hospitals of Derby and Burton NHS Foundation Trust. 

The panellists presented several scenarios to the audience. In one, a patient complained of a sore back, burning pain in their foot and recent “tingling in my privates.” while clinicians in the audience and onstage suggested that they would need to ask more questions, most suspected a condition that involves nerve compression. 

When fed the same scenario, the AI model prefaced its response with an acknowledgment that it was not a doctor – a response that Shuaib described as “an improvement” from previous iterations of the technology – before also suggesting possible nerve damage and saying the condition might require immediate medical attention. 

Audience members were asked to vote by a show of hands whether they would prefer doctors or AI across different domains of health care quality – safety, effectiveness, timeliness, patient-centredness, efficiency and equitable impact; the clinicians won comprehensively in all categories with the exception of timeliness, where they were badly beaten to little surprise, and patient-centredness and empathy, where the vote was closer. 

AI answers limited in surprising and predictable ways 

“For most of the scenarios, it was very difficult for us to get ChatGPT to give a bad answer,” Shuaib said. Two areas stood out, he noted, one of which was the calculation of a dose of medicine because it involved mathematical reasoning.  

“In the textual explanation, it explains its working out like a child doing a maths problem, but when it does the actual calculation it does it wrong, which is interesting,” he said. “So it has this explanation and veneer of respectability and correctness, but if you don’t look closely, it’s actually got the maths wrong. And it took us ten minutes to work out why it was wrong, so it made extra work for us to double check its answer.” 

In addition to the potential surprise that a computer suffers from deficient quantitative skills, Shuaib said, the example demonstrated that when paired with an expert user, AI “makes the expert user slower, and it makes the non-expert user over-confident, so it’s a double-edged sword.” 

A more complicated example trialled for the panel was a patient suffering from domestic violence, whose sister reported the patient appearing with bruises that she blamed on being clumsy, and suffering from depression. 

“We had to really manufacture that story, that narrative, because when we tried to be more subtle [the AI model] just wouldn’t get it,” Shuaib said. “It would suggest things like vitamin deficiencies and actual clumsiness, but it wouldn’t ask about any family issues or anything like that until we made it so bleeding obvious.” 

Once the model figured out what the issue was, he said, it gave a “delicate” response, advising that the clinician keep an eye on things and have a conversation without judgment and recognise that the patient might take time to open up.