Using Language Models for Surgical Consensus: Key Findings

Large language models may enhance consensus-building on ventral rectopexy but vary significantly in reliability.

  • Openevidence had the highest content appropriateness score (3.5/5), outperforming Gemini (3.0/5) and ChatGPT (2.8/5; p < 0.001).
  • ChatGPT fabricated 53% of citations compared to Gemini’s 12% and Openevidence’s 0% (p < 0.001).

This suggests a need for careful selection of tools in clinical discussions.

  • Only 34% of Openevidence citations were lower than level I-III studies, making it a trusted source for guidelines.

Journal Article by Lee FG, Larson EL (…) Perry WRG et 12 al. in Dis Colon Rectum

Copyright © The ASCRS 2026.

read the whole article in Dis Colon Rectum

open it in PubMed