Tech

Hugging Face releases a benchmark for testing generative AI on health tasks

Generative AI models are increasingly being introduced into healthcare settings – in some cases prematurely, perhaps. Early adopters believe they will benefit from increased efficiency while revealing information that would otherwise be missed. Critics, meanwhile, point out that these models have flaws and biases that could contribute to worse health outcomes.

But is there a quantitative way to know how helpful or harmful a model might be when tasked with tasks like summarizing patient records or answering health-related questions?

Hugging Face, the AI ​​startup, offers a solution in a recently released benchmark test called Open Medical-LLM. Created in partnership with researchers from the non-profit Open Life Science AI and the Natural Language Processing Group at the University of Edinburgh, Open Medical-LLM aims to standardize the evaluation of the performance of generative AI models on a range of medicine-related tasks.

Open Medical-LLM is not a from zero reference, in itself, but rather an assembly of existing test sets – MedQA, PubMedQA, MedMCQA, etc. – designed to probe patterns in general medical knowledge and related fields, such as anatomy, pharmacology, genetics and clinical practice. The benchmark contains multiple-choice and open-ended questions that require medical reasoning and understanding, drawing on material including U.S. and Indian medical licensing exams and college biology test question banks.

“(Open Medical-LLM) allows researchers and practitioners to identify the strengths and weaknesses of different approaches, stimulate further progress in the field, and ultimately contribute to better care and outcomes for patients. patients,” Hugging Face wrote in a blog post.

healthcare of the AI ​​generation

Image credits: Cuddly face

Hugging Face positions the benchmark as a “robust assessment” of generative AI models related to healthcare. But some medical experts on social media have warned against placing too much emphasis on Open Medical-LLM, lest it lead to misinformed rollouts.

On real clinical practice can be quite extensive.

Clémentine Fourrier, researcher at Hugging Face and co-author of the blog post, agrees.

“These rankings should only be used as a first approximation of the generative AI model to be explored for a given use case, but a more in-depth testing phase is always necessary to examine the limitations and suitability of the model under conditions real.” Fourrier replied. on »

This is reminiscent of Google’s experience when it attempted to bring an AI diabetic retinopathy screening tool to Thai health systems.

Google has created a deep learning system that analyzes images of the eye, looking for evidence of retinopathy, a leading cause of vision loss. But despite high theoretical accuracy, the tool proved impractical in real-world testing, frustrating both patients and nurses with inconsistent results and a general lack of harmony with field practices.

Tellingly, of the 139 AI-related medical devices approved to date by the U.S. Food and Drug Administration, none use generative AI. It is exceptionally difficult to test how the performance of a generative AI tool in the laboratory will translate to hospitals and outpatient clinics and, perhaps more importantly, how the results might change over time.

This is not to say that Open Medical-LLM is not useful or informative. The ranking of results, at the very least, reminds us how poorly the models answer basic health questions. But Open Medical-LLM, and no other benchmark for that matter, is no substitute for carefully considered real-world testing.

techcrunch

Back to top button