Large Language Models Cut Time, Cost of Guideline Development

September 11, 2025|GI and Hepatology News

FROM GASTROENTEROLOGY

Large language models (LLMs) may help streamline clinical guideline development by dramatically reducing the time and cost required for systematic reviews, according to a pilot study from the American Gastroenterological Association (AGA).

Faster, cheaper study screening could allow societies to update clinical recommendations more frequently, improving alignment with the latest evidence, lead author Sunny Chung, MD, of Yale School of Medicine, New Haven, Connecticut, and colleagues, reported.

“Each guideline typically requires 5 to 15 systematic reviews, making the process time-consuming (averaging more than 60 weeks) and costly (more than $140,000),” the investigators wrote in Gastroenterology . “One of the most critical yet time-consuming steps in systematic reviews is title and abstract screening. LLMs have the potential to make this step more efﬁcient.”

To test this approach, the investigators developed, validated, and applied a dual-model LLM screening pipeline with human-in-the-loop oversight, focusing on randomized controlled trials in AGA guidelines.

The system was built using the 2021 guideline on moderate-to-severe Crohn’s disease, targeting biologic therapies for induction and maintenance of remission.

Using chain-of-thought prompting and structured inclusion criteria based on the PICO framework, the investigators deployed GPT-4o (OpenAI) and Gemini-1.5-Pro (Google DeepMind) as independent screeners, each assessing titles and abstracts according to standardized logic encoded in JavaScript Object Notation. This approach mimicked a traditional double-reviewer system.

After initial testing, the pipeline was validated in a 2025 update of the same guideline, this time spanning 6 focused clinical questions on advanced therapies and immunomodulators. Results were compared against manual screening by 2 experienced human reviewers, with total screening time documented.

The system was then tested across 4 additional guideline topics: fecal microbiota transplantation (FMT) for irritable bowel syndrome and Clostridioides difficile, gastroparesis, and hepatocellular carcinoma. A final test applied the system to a forthcoming guideline on complications of acute pancreatitis.

Across all topics, the dual-LLM system achieved 100% sensitivity in identifying randomized controlled trials (RCTs). For the 2025 update of the AGA guideline on Crohn’s disease, the models flagged 418 of 4,377 abstracts for inclusion, captur-ing all 25 relevant RCTs in just 48 minutes. Manual screening of the same dataset previously took almost 13 hours.

Comparable accuracy and time savings were observed for the other topics.

The pipeline correctly flagged all 13 RCTs in 4,820 studies on FMT for irritable bowel syndrome, and all 16 RCTs in 5,587 studies on FMT for Clostridioides difficile, requiring 27 and 66 minutes, respectively. Similarly, the system captured all 11 RCTs in 3,919 hepatocellular carcinoma abstracts and all 18 RCTs in 1,578 studies on gastroparesis, completing each task in under 65 minutes. Early testing on the upcoming guideline for pancreatitis yielded similar results.

Cost analysis underscored the efficiency of this approach. At an estimated $175–200 per hour for expert screeners, traditional abstract screening would cost around $2,500 per review, versus approximately $100 for the LLM approach—a 96% reduction.

The investigators cautioned that human oversight remains necessary to verify the relevance of studies flagged by the models. While the system’s sensitivity was consistent, it also selected articles that were ultimately excluded by expert reviewers. Broader validation will be required to assess performance across non-RCT study designs, such as observational or case-control studies, they added.

“As medical literature continues to expand, the integration of artiﬁcial intelligence into evidence synthesis processes will become increasingly vital,” Dr. Chung and colleagues wrote. “With further reﬁnement and broader validation, this LLM-based pipeline has the potential to revolutionize evidence synthesis and set a new standard for guideline development.”

This study was funded by National Institutes of Health, National Institute of Diabetes and Digestive and Kidney Diseases. The investigators reported no conflicts of interest.