| アイテムタイプ |
学術雑誌論文 / Journal Article(1) |
| 公開日 |
2025-12-26 |
| タイトル |
|
|
タイトル |
Reasoning-optimised large language models reach near-expert accuracy on board-style orthopaedic exams: A multi-model comparison on 702 multiple-choice questions |
|
言語 |
en |
| 言語 |
|
|
言語 |
eng |
| キーワード |
|
|
言語 |
en |
|
キーワード |
artificial intelligence |
| キーワード |
|
|
言語 |
en |
|
キーワード |
clinical decision support |
| キーワード |
|
|
言語 |
en |
|
キーワード |
large language models |
| キーワード |
|
|
言語 |
en |
|
キーワード |
medical education |
| キーワード |
|
|
言語 |
en |
|
キーワード |
orthopaedic surgery |
| 資源タイプ |
|
|
資源タイプ |
journal article |
| アクセス権 |
|
|
アクセス権 |
open access |
| 著者 |
Diniz, Pedro
横江, 琢示
WEKO
34429
e-Rad_Researcher
50895894
| ja |
横江, 琢示
宮崎大学
|
| ja-Kana |
ヨコエ, タクジ
|
| en |
Yokoe, Takuji
University of Miyazaki
|
Search repository
Öttl, Felix C
Pereira, Hélder
Henriques, Rui
Samuelsson, Kristian
|
| 抄録 |
|
|
内容記述タイプ |
Abstract |
|
内容記述 |
The purpose of this study was to compare the accuracy, calibration, reproducibility and operating cost of seven large language models (LLMs)-including four newer models capable of using advanced reasoning techniques to analyse complex medical information and generate accurate responses-on text-only orthopaedic multiple-choice questions (MCQs) and to quantify gains over GPT-4./From Orthobullets, 702 unique, non-image MCQs (drawn from AAOS Self-Assessment Examinations, Self-Assessment-Based Questions and Orthopaedic In Training Examination-Based Questions banks) were extracted. Each question was submitted to the following LLMs: OpenAI o3, Anthropic Claude Sonnet 4, Claude Opus 4 (with/without 'Extended Thinking') and Google Gemini 2.5 Pro. Additionally, OpenAI's GPT-4, GPT-4o and the open-weight Gemma 3 27B served as comparators. The primary outcome was overall accuracy. The secondary outcomes were topic and difficulty-stratified accuracy, calibration (expected calibration error [ECE] and Brier score), reproducibility (flip rate on a retest question subset), latency, token use and cost. Statistical tests included paired McNemar, Cochran Q, ordinal logistic regression and Fleiss κ (Bonferroni-adjusted α = 0.05)./GPT-4 achieved 69.7% accuracy (95% CI = 66.2-72.9). All four reasoning-optimised models scored ≥14 percentage points higher (p < 3.3 × 10-15); OpenAI o3 led with 93.6% (95% CI = 91.5-95.2), which represents a 34% relative error reduction. Accuracy tended to decline with question difficulty, yet the reasoning advantage persisted in every difficulty stratum. Claude Opus 4 showed the best calibration (ECE = 0.023), while GPT-4 exhibited overconfidence (ECE = 0.215). All models except Gemma 3 27B exhibited non-zero flip rates. Median query time: 0.9 s (Gemma) to 15.9 s (Gemini 2.5 Pro). Cost: 0 to 29.9 USD per 1000 queries./Reasoning-optimised LLMs now answer text-based orthopaedic exam questions with high accuracy and substantially better confidence calibration than earlier models. However, persistent stochasti |
|
言語 |
en |
| 書誌情報 |
en : Knee surgery, sports traumatology, arthroscopy : official journal of the ESSKA
発行日 2025-12-17
|
| 出版者 |
|
|
出版者 |
Wiley |
|
言語 |
en |
| ISSN |
|
|
収録物識別子タイプ |
EISSN |
|
収録物識別子 |
14337347 |
| DOI |
|
|
関連タイプ |
isVersionOf |
|
|
識別子タイプ |
DOI |
|
|
関連識別子 |
https://doi.org/10.1002/ksa.70222 |
| 権利 |
|
|
権利情報 |
© 2025 The Author(s). |
|
言語 |
en |
| 著者版フラグ |
|
|
出版タイプ |
VoR |