Objective: The objective is to predict the onset (time) of sepsis based on time series data of physiological measurements taken (summarized) hourly for patients in ICU. The data is extremely sparse with a large proportion of values missing, mainly because not all lab measurements were collected for the patients every hour. In addition, the data is extremely unbalanced with much fewer sepsis data points compared to non-sepsis data points. The length of the time series for each patient varied from a few hours to more than 10 days. Motivated by all these challenges, we intended to develop machine learning models for early prediction of sepsis status based on the physiological time series in this dataset.
Methods: We constructed machine learning models with recurrent neural networks using LSTMs. First, we combined samples in data sets A and B, and filled all the missing values by zeros. We randomly split the data to 75% training and 25% testing. For training patients with extended stay in ICU, we chopped those time series to shorter segments. To address the unbalanceness, we trained LSTMs with random subsets (size 500) of the negative training samples and a subset (two thirds) of the positive training samples, and validated the models using the remaining samples in the training set. We obtained 100 such models, and decided the best set of negative training samples as the one with best validation performance. Finally, we trained a final model with the best set of negative training samples and all positive training samples, and evaluated the final model on the testing sample.
Conclusion: The validation and the test utility are both around 0.5. Moving beyond our “final” model at this moment, we are currently implementing a greedy algorithm to further optimize the negative training set.