The Signature-based Model for Early Detection of Sepsis from Electronic Health Records

James Morrill and Andrey Kormilitzin
University of Oxford


Aims: In this study we proposed a novel, signature-based, machine learn-ing model to automatically identify a patient's risk of sepsis based on physi-ological data streams and make a positive or negative prediction of sepsis for every time interval since admission to the intensive care unit. Methods: The signature transformation defined mathematically is a ho-momorphism from the monoid of paths into the group-like elements of the free tensor algebra. It provides a hierarchical method to succinctly summarise a multidimensional path and encode the longitudinal information without the need for hand-crafted features. The signature terms are canonical representa-tions (features) that can approximate arbitrary well any function on paths. We framed this problem as predicting a binary sepsis label using preceding ob-servations of physiological data over the period of six hours. First, at each time-stamped bin within the window we computed the probability of having sepsis by using all physiological data measured within this time bin; miss-ing values were imputed by carry-forward approach. Then we computed the log-signature transform of a data stream within the window, consisting of six specific vital signs: ‘HR’, ‘O2Sat’, ‘SBP’, ‘DBP’, ‘Resp’ and ‘Temp’. The final set of features comprised both log-signatures and probabilities and was used as input into gradient boosting machine (GBM) to predict sepsis within the next hour following the six-hour window. Area under the receiver operat-ing characteristic curve (AUC) and a specific utility function (U) created for the Challenge were used to assess the quality of predictions. Results: AUC = 0.81 and U = 0.41 were achieved by using the signature method and XGBoost GBM algorithm using 3-fold cross validation. Conclusion: The signature method showed a valid and direct approach to extract features from longitudinal physiological data streams without the need for ad hoc feature engineering.