Lecture Introduction and Theme Overview
Jack Van Horn from the University of Virginia introduced the 2023-2024 seminar series focused on "Generative AI for Healthcare." This lecture series, supported by several UVA institutions and the National Institutes of General Medical Sciences, aims to explore the intersection of data science methodologies, artificial intelligence, and biomedical sciences.
The theme for the current year emphasizes building partnerships for generative AI training in biomedical and clinical research. Speakers in this series will share insights about the challenges, opportunities, and innovative applications of AI in the healthcare sector.
"Today we are excited to welcome our speaker Dr. Roxanna Danu, who will discuss the potential applications of generative AI in healthcare."
Speaker Background
Dr. Roxanna Danu, a specialist in biomedical data science and dermatology at Stanford University, has a robust educational background including bioengineering at Rice University and an MD/PhD from Stanford focusing on genetics.
In addition to her academic accomplishments, Dr. Danu has clinical experience allowing her to bridge the gap between theoretical AI applications and practical medical situations. She leads a research lab that concentrates on fair and transparent AI in healthcare.
"My clinical experience colors everything that I do within my research."
AI Challenges in Healthcare
Dr. Danu opened with an analogy about patient care delays in the current American healthcare system, illustrating the frustrations of accessing timely dermatology care after discovering a concerning skin lesion.
She highlighted systemic issues such as long wait times and accessibility, which affect both patients and healthcare professionals. The prevailing sentiment is that the healthcare system is broken and in need of improvement, which raises hopes for AI as a potential solution.
"The healthcare system is broken in many, many ways."
The Role of Generative AI in Streamlining Healthcare
The talk emphasizes AI not as a replacement for healthcare professionals but as a tool to aid them, potentially streamlining processes and improving diagnostic capabilities. Dr. Danu noted a tension in the community regarding AI replacing doctors versus assisting them.
While generative AI has gained popularity, it is still far from being capable of independently practicing medicine. However, AI can enhance non-specialists' ability to provide care, potentially alleviating some burdens on specialists.
"AI should be seen as aiding physicians rather than replacing them."
Advances in Generative AI and Healthcare
The transition from earlier GPT models, like GPT-2 and GPT-3, to GPT-3.5 and GPT-4, marks a monumental leap in capabilities. This advancement has led to significant improvements in generative AI with potential impacts on healthcare.
The speaker recalls their excitement upon discovering GPT-3.5 during a conference, where it was able to answer medical questions that earlier models struggled with.
The rapid pace of integrating these generative AI models into healthcare is surprising, especially given the traditionally slow adoption of new technologies in the sector.
Companies like Epic and Microsoft have already collaborated to incorporate GPT-4 into electronic health records, while others, like Google, are testing their own models in hospital systems.
Despite the swift developments, there is concern about the lack of thorough clinical trials and evaluation frameworks to assess the efficacy of these AI models in healthcare. This raises numerous unanswered research questions regarding their effectiveness.
"It's shocking to me how quickly things have moved, given that healthcare generally moves very slowly when it comes to technology adoption."
Computer Vision in Healthcare
The section highlights various AI tools developed for computer vision in healthcare, such as models that predict pneumonia from chest X-rays or assess skin lesions for malignancy.
A significant issue with many of these models is that they often operate as 'black boxes,' making it difficult to understand the clinically relevant features they utilize for decision-making.
An illustrative example involves skin cancer detection models that sometimes rely on spurious correlations, such as recognizing a purple marker used during biopsies, rather than identifying clinically significant features.
The development of explainable AI is crucial to ensure human users grasp what influences an algorithm's decisions, enhancing trust and usability in clinical settings.
"You want to ensure that spurious features, like a purple marker, are not what's being used to make the assessment."
Methodology for Evaluating AI Models in Dermatology
An innovative approach was taken to evaluate the reasoning processes of AI models used in dermatology by creating counterfactual images. This method was adapted from techniques used in facial recognition tasks.
The team utilized a generative AI model to modify reference images of skin lesions in realistic ways to see how these alterations affected the predictions of an existing model, "Deep Derm."
By producing variations of images that appear more benign or malignant, researchers can assess which features influence the AI's decision-making from a clinically relevant standpoint.
This extensive analysis has uncovered promising and concerning factors impacting AI's determinations, providing insights that can enhance the interpretability of AI outputs in clinical practices.
"What we can do is... have experts look at these pairs of images and identify what are the differences between those images in clinically relevant terms."
Model Evaluation and Insights
The evaluation of generative AI models in healthcare involves testing multiple models to understand how different factors influence predictions of lesions being benign or malignant. The presence of background elements, such as hair in images, can be an inappropriate feature for models to focus on, leading to potential inaccuracies.
By assessing the characteristics used by various models, researchers can gain insights into whether they rely on clinically relevant features. Discrepancies in model performance might indicate a need for revisiting training datasets to enhance accuracy and effectiveness.
"The methodology of generative AI gives us new insights into how models are working and whether they're using appropriate features."
Use of Synthetic Data in Dermatology
The use of synthetic data in training models is significant due to the protected nature of healthcare data. Research indicates that a majority of AI and dermatology studies rely on proprietary data sets that are not publicly accessible, which creates challenges in developing effective AI models due to a lack of diverse training examples.
Researchers explored synthetic images as a method of augmentation, particularly to address bias in models that may not perform well on images of diverse skin tones. While synthetic images can help improve performance, they must be used carefully to avoid exacerbating existing biases.
"Synthetic images can help with augmentation, but if there's an imbalance in how they're used, it will likely lead to biases and performance issues."
Understanding Dermatologists’ Use of Large Language Models
A survey conducted among dermatologists revealed that 65% of respondents have utilized large language models for clinical care. This indicates a growing acceptance and integration of AI tools in daily practice.
Notably, among those using these models, a significant majority engage with ChatGPT, with some employing it on a daily basis. Awareness of how AI tools are utilized can guide standards and expectations for integrating AI into healthcare practices.
"Sixty-five percent of dermatologists reported using large language models for clinical care."
Usage of Large Language Models by Physicians
A recent study found that 79% of physicians are utilizing large language models (LLMs) in their clinical practice, which was surprising given expectations that their use would be limited to administrative tasks or medical record management.
The study revealed that these physicians are actively employing LLMs for clinical decision-making processes, demonstrating a significant integration of generative AI in healthcare.
However, many physicians lack a full understanding of how LLMs work and the biases they may contain. This gap in knowledge raises concerns about the accuracy and reliability of AI-assisted clinical decisions.
Initial feedback on the accuracy of LLM outputs indicated that most physicians considered them "somewhat accurate," leading to issues around the frequency with which they found it necessary to edit or correct AI-generated information.
"79% of physicians said they are using large language models for clinical care."
Concerns About Bias in AI Responses
There are ongoing concerns regarding the potential for LLMs to perpetuate biases present in training data, as shown in studies that examined how these models respond to questions related to race and medicine.
One notable example discussed involved a model that reinforced false beliefs about racial differences in medical conditions, such as the misconception that race plays a role in kidney function calculations. This reinforces harmful stereotypes and inaccuracies within clinical practice.
The implications of these biases are considerable, as they can contribute to the perpetuation of racial disparities in healthcare outcomes, thereby affecting patient safety and equity.
"Large language models perpetuate false race-based medicine."
Red Teaming to Identify Model Vulnerabilities
Researchers organized a "red teaming" event aimed at uncovering vulnerabilities in LLMs applicable to healthcare. This interdisciplinary session brought together computer scientists, biomedical data scientists, and physicians to assess LLM responses critically.
Participants explored various aspects of model safety, including potential bias, privacy concerns, and factual inaccuracies, labeling responses based on their implications for patient care.
After thorough evaluation, they found that 20% of the responses generated by these models were deemed inappropriate or inadequate for clinical use.
"After evaluating the responses, we found that 20% of them were inappropriate."
Examples of Inaccuracy in AI Responses
Specific examples of inaccuracies were highlighted, showcasing the potential for patient harm. Inaccurate responses about medical conditions, drug reactions, and scoring systems created risks for healthcare providers who might rely on this flawed information.
One case involved an incorrect calculation for DRESS syndrome scoring, emphasizing how AI could lead to dangerous outcomes, especially when providers lack existing knowledge of the information the model provides.
"In an accurate response, the AI gave the wrong point value for the eosinophilia account."
The WebMD Effect and Self-Diagnosis
Many individuals turn to online platforms, similar to how they might use WebMD, to self-diagnose health issues by uploading images or symptoms for immediate feedback. This trend raises concerns about potential risks and effectiveness.
Self-advocacy in healthcare is a fundamental right, leading to positive outcomes, as exemplified by cases where patients discover rare conditions through their research. One patient utilized ChatGPT to propose a diagnostic avenue that had been overlooked by multiple physicians.
Despite the potential benefits of self-research, there is a risk of encountering misinformation and experiencing confirmation bias, where individuals seek out information that confirms their beliefs.
"Patients are going to look up their symptoms... and I think that's a fair thing for people to do."
The Dual Nature of Generative AI in Healthcare
The panel acknowledges the mixed implications of generative AI within medical contexts. While there are beneficial cases, the dangers of inaccurate information and automation bias also exist.
There is concern about over-reliance on technology, including in hospital settings, similar to incidents where GPS misinformation led users to dangerous situations. Trusting systems without critical evaluation can result in significant consequences.
A particular concern arises with larges language models being integrated into electronic health record systems, where hallucinations can lead to significant errors, influencing patient care adversely.
"There’s always that potential for harm... but there are also examples of how it's been useful."
Issues with Automation and AI Adoption in Healthcare
The rapid adoption of AI technologies in healthcare raises concerns about their preparedness in monitoring and approval compared to other fields like aviation. There is skepticism about the gradual booking of large language models into practice without thorough vetting.
Poorly designed electronic health record systems complicate the situation, making a beautiful summary of patient information potentially misleading if the AI model generates inaccurate details.
The testing and regulatory frameworks for large language models appear lacking as opposed to those for computer vision systems that have seen stricter scrutiny before being implemented.
"I worry a lot on the large language model side... the bar to get approval in computer vision has been so much higher."
Algorithmic Bias in Healthcare
The discussion addresses the inevitability of some level of error in healthcare algorithms, with the necessity for human oversight to mitigate potential harm. It highlights examples of algorithmic harm that have already occurred in the healthcare system, illustrating the consequences of biased algorithms.
A significant case involved an algorithm that inaccurately allocated resources to white patients over Black patients, despite the latter being sicker. This discrepancy stemmed from the model using healthcare spending as an indicator of a patient’s health, leading to systemic inequities.
Another example cited concerns United Healthcare, where an algorithm made decisions about rehabilitation services without human intervention, resulting in a patient's untimely death when their services were inexplicably cut off.
"Algorithmic harm is not some nebulous thing; it's something that's already happened in our healthcare system."
Educational Needs for Future Healthcare Professionals
The need for education surrounding the limitations and appropriate applications of algorithms in healthcare is emphasized. It is crucial for newcomers to fields like dermatology or data science to understand where algorithms can be beneficial and where they may fall short.
Various professional societies, including the Society of Imaging Informatics in Medicine and the American Academy of Dermatology, are actively working to train their workforce on these issues.
While resources are available online, the rapid pace of technological advancement makes it challenging to keep curricula up-to-date. Engaging in ongoing education and collaboration within fields is essential for the next generation of healthcare professionals.
"We definitely need to educate our workforce."
Comments