Using Language Models for Surgical Consensus: Key Findings

Large language models may enhance consensus-building on ventral rectopexy but vary significantly in reliability.

Openevidence had the highest content appropriateness score (3.5/5), outperforming Gemini (3.0/5) and ChatGPT (2.8/5; p < 0.001).
ChatGPT fabricated 53% of citations compared to Gemini’s 12% and Openevidence’s 0% (p < 0.001).

This suggests a need for careful selection of tools in clinical discussions.

Only 34% of Openevidence citations were lower than level I-III studies, making it a trusted source for guidelines.