To examine an approach for selecting small sets of diagnosis codes with high prediction performance in large datasets of electronic medical records.
Modelling study using national hospital and mortality records for patients with myocardial infarction (n=200 119), hip fracture (n=169 646), or colorectal cancer surgery (n=56 515) in England in 2015-17. One-year mortality was predicted from ICD-10 codes recorded for at least 0.5% of patients using logistic regression (‘full’ models). An approximation method was used to select fewer codes that explained at least 95% of variation in full model predictions (‘reduced’ models).
One-year mortality was 17.2% (34 520) after myocardial infarction, 27.2% (46 115) after hip fracture, and 9.3% (5273) after colorectal surgery. Full models included 202, 257, and 209 ICD-10 codes in these populations. C-statistics for these models were 0.884 (95% CI 0.882, 0.886), 0.798 (0.795, 0.800), 0.810 (0.804, 0.817). Reduced models included 18, 33, and 41 codes and had c-statistics of 0.874 (95% CI 0.872, 0.876), 0.791 (0.788, 0.793), 0.807 (0.801, 0.813). Performance was also similar when measured using Brier scores. All models were well calibrated.
Our approach selected small sets of diagnosis codes that predicted patient outcomes comparably to large, comprehensive sets of codes.

Copyright © 2020 Elsevier Inc. All rights reserved.

Author