One of the fascinating aspects of the AI field is the groundbreaking research that continually emerges, propelling us further into the realm of technological possibility. A recent study released by Google researchers has indeed caused quite a stir in the AI community and beyond. It revolves around the development of a custom language model that outscored not just its AI predecessors, but also human professionals, on the US Medical Licensing Exam.
Groundbreaking Results: The AI vs. Doctor Showdown
The bespoke language model (LLM), tuned to exhibit medical domain knowledge, achieved an impressive score of 86.5% on a battery of thousands of questions modeled after the USMLE. To put this into perspective, a human passing score on the USMLE is approximately 60%, a mark that was surpassed by the previous model as well.
What makes these results even more remarkable is that the AI didn't merely outscore human professionals, but its responses were rated by a panel of doctors as superior to human answers across an array of questions.
Under the Hood: The Methodology
The model, a specially tuned version of Google's newly announced PaLM 2, was evaluated using the MultiMedQA evaluation set, consisting of thousands of questions. Long-form responses were further tested using a panel of human doctors to evaluate them against human answers in a pairwise evaluation study. Additionally, the researchers aimed to identify potential vulnerabilities in the AI by using an adversarial dataset to provoke harmful responses.
The Med-PaLM 2 scored an all-time high of 86.5% across the MedQA benchmark questions, a significant leap from previous AI models and GPT 3.5. The LLM's long-form responses also showcased marked improvements.
Another striking finding was that a panel of 15 human doctors preferred the AI's answers over real doctor answers across 1066 standardized questions. The AI's answers were rated higher in aspects such as medical consensus, comprehension, knowledge recall, reasoning, and low intent to harm.
However, the AI model still demonstrated weaknesses, particularly when it came to generating inaccurate or irrelevant information, hinting at the persistent problem of hallucination.
What Does This Mean for the Future?
While these results represent a significant milestone, it's important to note that this doesn't mean doctors are at risk of being replaced by AI. As pointed out by the researchers, real-life scenarios are far more complex and nuanced, often requiring follow-through questioning not assessed in this study.
Nonetheless, these findings underline the potential role of AI in medicine. The AI "gold rush" is on, with companies investing heavily in AI development to augment or replace white-collar roles. As an example, venture capital firm Andreessen Horowitz recently invested $50M in Hippocratic AI, a company developing an AI to facilitate communication with patients.
Domain-specific LLMs like Med-PaLM 2 are likely to become more commonplace. As demonstrated by this study, there is enormous potential in fine-tuning LLMs to serve as domain experts as opposed to relying on generic models.
While we are still in the early stages, the integration of AI into medicine is not a question of if, but when. These developments hint at a future where many of our medical interactions could be conducted with AI chatbots, saving the limited and precious resource that is the time and expertise of human doctors for the most complex and nuanced cases. One thing is certain: the future of medicine is exciting, and AI will play a significant role in shaping it.