Experts have warned of “health data poverty” if new technologies like artificial intelligence are based on unrepresentative datasets.
Researchers from INSIGHT, the Health Data Research Hub for eye health, and University Hospitals Birmingham NHS Foundation Trust, said technology has the potential to reinforce healthcare inequalities if it is not informed by representative data.
As artificial intelligence (AI) gains traction in healthcare, many academics and commercial organisations have developed it, as well as other digital health solutions, on publicly available datasets.
But little is known about how many datasets actually exist, or the diversity of people and health conditions represented within them, which could lead to the development of technologies and products that only work for certain groups or countries, researchers said.
Focusing on eye health, consultant eye specialist and director of INSIGHT, Professor Alastair Denniston, and his colleagues carried out a global search to explore the availability of publicly available datasets and the extent to which they represented the diversity and needs of the world’s population.
They identified and analysed 94 datasets containing 507,724 clinical images and 125 videos of eyes gathered from at least 122,364 people.
Following this they created a comprehensive catalogue detailing the source of each dataset, its accessibility, and the populations, diseases and types of images represented within it.
They found most images came from populations in Asia and Europe, with very few datasets from large parts of the world such as sub-Saharan Africa (one dataset) and South America (two datasets).
They also discovered that information about the people within each dataset was generally poor, with basic demographic information such as age, sex and ethnicity missing in more than one in five datasets.
Professor Denniston said: “We hope that our catalogue will raise awareness of more diverse datasets for the development of AI-based health technologies.
“We need to act now to encourage health systems and researchers to invest in publicly available datasets to support research and innovation in areas that are currently data poor.
“Otherwise, we risk perpetuating a growing digital divide where healthcare technologies are only developed to benefit diseases, populations and countries with advanced data infrastructure.”
The lack of geographical diversity could lead to technologies being developed that work well for one population but not for another, they warned.
There was also disparities with the types of eye disease depicted.
Most images identified in the research were relevant to diseases including diabetic retinopathy, glaucoma and age-related macular degeneration. Researchers noted this is because these images are routinely collected as part of healthcare and screening in countries with advanced modern health infrastructure.
But data for cataracts, trachoma and refractive error – which have been designated as priority eye diseases by the World Health Organization and account for half of all global blindness – were significantly under-represented.
These conditions are common in low- and middle-income countries where digital technology could make a big difference in enabling access to healthcare, researchers said.
The “lack of relevant data for developing and training AI-based tools makes it less likely that researchers and companies will be able to develop products that could help”, they said in a statement.
Caroline Cake, chief executive of Health Data Research UK added: “The coming generation of digital health technologies are only as good as the data we use to develop them, and this new study highlights the fact that datasets must be representative and inclusive if these tools are to be relevant and applicable to all.
“We are committed to working with our national and international partners to ensure that advances in digital healthcare bring benefits to everyone.”
The research was published in The Lancet Digital Health.