Sample-and-hold/mean Imputation and XGBoost for Sepsis Prediction

Lei Zuo and Hewei Yang
University of Science and Technology Beijing


Aims: Sepsis is a major public health issue responsible for significant morbidity, mortality and healthcare expenses. Early detection of sepsis is critical for improving sepsis outcomes. This study aimed to predict sepsis 6 hours before the clinical prediction of sepsis at least based on machine learning and find out the important factors. Methods: In this study, early detection of sepsis was regarded as a binary classification that is sepsis or non-sepsis patient. The data used in this study were 40,336 subjects (1,552,210 time-points) from ICU patients with 40 time-dependent features including Demographics, Vital Signs, and Laboratory values. But we deleted the features which missing rate are more than 50% in all data, and contrarily, filled the missing values which missing rate are less than 50% using an interpolation algorithm based on the previous-moment value, the next-moment value and the mean value. Then, we trained an Extreme Gradient Boosting (XGBoost) model using selected features to predict sepsis and find out the important factors associated to the sepsis patient. To evaluate the model performance, we calculated accuracy and utility score. Results: We selected 12 features to train the XGBoost model which can predict sepsis between 12 hours before and 3 hours after the clinical prediction time. As a result, the training accuracy is up to 99% and the cross validated utility score obtained in phase 1 is 0.40 on the available public data. According to the number of times a feature is used to split the data across all trees, the top 3 important features are “age”, “HospAdmTime” (Hours between hospital admit and ICU admit) and “ICULOS” (ICU length-of-stay (hours since ICU admit)) respectively. Conclusion: Model based on machine learning can predict sepsis early, which can support clinicians in the decision of sepsis.