
Google launches LLM evaluation tool for health data
Google has developed a new evaluation framework to help health systems assess large language models more efficiently and reliably.
The framework, called Adaptive Precise Boolean rubrics, converts complex evaluation tasks into yes-or-no questions tailored to each query. It aims to reduce dependence on expert reviews while improving scoring consistency, according to an Aug. 26 news release from the company. The approach was tested in metabolic health use cases, including diabetes, cardiovascular disease and obesity.
Compared to traditional Likert scales, the tool improved inter-rater reliability and cut evaluation time by more than 50%. It also showed stronger sensitivity to subtle changes in response quality, including scenarios in which patient data was deliberately omitted from LLM prompts, the release said.
In a study using de-identified data from 141 participants with metabolic conditions, the rubric method reliably flagged quality drops in responses missing key personal data. Evaluations using Likert scales did not consistently detect those differences, the company said.
An automated version of the rubric classifier achieved an accuracy of 0.77 and an F1 score of 0.83, allowing health systems to scale evaluations without full human review. The auto-adapted rubrics performed comparably to human-curated versions.
Google said the tool is not tied to any product and is intended for controlled research. The company said the framework could support scalable, safety focused evaluations across clinical decision support, patient education and other LLM use cases in healthcare.
The post Google launches LLM evaluation tool for health data appeared first on Becker’s Hospital Review | Healthcare News & Analysis.