Feature Importance for Sepsis Prediction Using Gradient Boosting Decision Tree

Kang Yanni and Jia Xiaoyu
PingAn Technology Company


Sepsis is a life-threatening disease, and it is the major cause of mortality in ICU. So the early detection and antibiotic treatment of sepsis are critical for improving sepsis outcomes. We used XGBoost, which is an optimized distributed gradient boosting library, to build a Gradient Boosting Decision Tree (GBDT) model that predicts the detection for sepsis. In this challenge, for each subject, ICULOS time series samples are generated, for every time step, we use the 40 features provided yet together with statistics features generated from the beginning of the measurement until the time step. The statistics features includes minimum, maximum, mean, standard deviation, skew and number of measurements. In total, we obtain 280 features for each time series sample. the XGBoost algorithm is used to train the model, we split the original data set into training set (80% sujects), validation set (10% subjects), and testing set (10% subjects). For better evaluate the result, we use the K-fold cross-validation to train the parameters. For fine-tune our model, the grid search method is used to find the best hyper parameter. We calculated the top 5 importance feature, including HospAdmTime, ICULOS, EtCO2, Age, Temp. Our results show that HospAdmTime (Hours between hospital admit and ICU admit) and ICULOS (ICU length-of-stay) are the two most important features for sepsis detection. In addition, age is another sensitive feature. The older you get, the more likely you are to get sepsis. In conclusion, the AUC value reaches 0.84 and the utility value, which is the scoring metric, achieves 0.5. Moreover, we also find the significance of vital signs is larger than the significance of laboratory values.