Using Missing Indicators and Difference Features to Predict Sepsis with XGBoost

Humza Haider
University of Alberta


Introduction: Sepsis is a serious condition which can occur as the body is responding to an infection and can lead to an array of outcomes including tissue damage, organ failure, and death. The ability to predict sepsis before it occurs has the potential to greatly reduce the mortality rate and length of hospital stay for sepsis patients. This study aims to develop a system which can predict the onset of sepsis before it occurs, ideally within six to twelve hours before onset.

Methods: The data consisted of 40,336 patient files including hourly data for vital signs, lab tests, and demographics. For reasons such as clinicians not conducting hourly lab tests, the data set contains an inordinate amount of missing values. In addition to usual forward-fill and mean imputation methods, Missing Indicators (MIs) were constructed as features to signify which observations had been imputed or were true observations. Another aspect of this data set is the time dimension; it is imperative that the patient's previous medical information be taken into account. To remedy this, difference features were employed which measured the difference between the feature values at the current time and those values collected at previous time points. Once feature extraction was completed, gradient boosted trees (namely XGBoost) were trained with adjusted class weights to predict if a patient would experience sepsis in the next twelve hours.

Results: A normalized utility score of 0.374 was achieved for a random subset of unobserved test data. The ongoing work will explore the use of custom features related to the rule based systems of those such as the sequential organ failure assessment score (SOFA) and the Systemic Inflammatory Response System.