Evaluating the Performance of GPT-4 on Surgical Knowledge Assessments

Large language model GPT-4, specifically the chatbot version ChatGPT, demonstrates near or above human-level performance on surgical knowledge assessments from two widely used question banks. In open-ended questions, ChatGPT answers accurately to 47.9% and 66.1% of questions, while for multiple-choice questions, it achieves accuracy rates of 71.3% and 67.9%. However, inconsistencies in responses on repeat queries raise concerns regarding the safe and reliable application of large language models like ChatGPT in the clinical setting.