Dealing with Imbalance and Missing Values in Electronic Health Record Data to Produce Accurate and Interpretable Sepsis Prediction

Tatiana Malygina1, Elena Ericheva2, Ivan Drokin2
1ITMO University, Intellogic LLC, 2Intellogic LLC


Abstract

During unofficial phase of PhysioNet/CinC 2019 competition we developed baseline solution with gradient boosting on decision trees, which computed sepsis probability at moment t based on patient's data at moments t and t-1, both with their difference. We used catboost since it deals with categorical variables out of the box (we considered demography data to be constant values, which rarely change during patient's stay at ICU). We filled missing values in patient's sequence of vital signs based on previous/next values, when the whole column of measurements was missing, we used mean value of this column. We used 11-fold cross-validation to be able to produce stable result and to use the whole dataset of 40k patient's data (we also did balancing by undersampling on each train/test split, since sepsis is a rare event and dataset is highly imbalanced). Our baseline solution received a normalized utility score of 0.52 during unofficial phase.

We think that the following ideas might help to improve our baseline solution during official phase:

  1. It might be better to fill some of missing rare measurements using linear regression model based on patient's other (non-missing) measurements, since some of the measurements correlate between each other; other way to improve our model is to filter measurements used in our model based on our baseline solution feature importances).
  2. To be able to predict sepsis early and stop it from progressing one should take into account patient's vital signs and measurements both with demography information as a multivariate time series data. Sepsis is also quite rare event. That's why during official phase we plan to develop method (LSTM-based), which takes into account patient's information at different moments of time and defines sepsis as some anomaly in patient's state - to compare it with our baseline solution.