Early Prediction of Sepsis Using SMOTE and Logistic Regression

Fahim Mahmud and Naqib Sad Pathan


Aims: The study is aimed at detecting sepsis in ICU patients based on clinical data as early as possible using Logistic Regression.

Methods: Despite the fact that qSOFA score emphasizes 3 factors for sepsis detection, they were not sufficient in most of the cases to predict sepsis early on. In addition to that, SOFA score indicator parameters were not available most of the time. Hence, focus was given on other available data such as Heart Rate (HR), O2Sat and others. Among the available data of nearly 188000 hours of 5000 ICU patients, only 1% were marked as having sepsis, indicating a huge class imbalance. Moreover, some of the data were skewed. A first attempt was made to spot check a number of algorithms of which Logistic Regression (LogReg) gave the best outcome with a utility score of 5%. Emphasizing the class imbalance via putting more weight on sepsis labels resulted in a utility score of 12% in the second phase. However, in the last phase, we first fitted a LogReg model with liblinear function as the solver and used Recursive Feature Elimination (RFE) to find the 20 most dominant features with respect to target variable. 1 out of those 20 features was eliminated based on the p-value>0.05. These 19 features were finally used to fit a model on the data. The coefficients thus obtained were used in the LogReg function along with a thresholding, resulting in a utility score of 26%.

Observations: The most crucial aspects of the data-set is the heavy class imbalance and lots of missing values. Focusing on these two aspects will improve the performance. Other techniques such as bagging, boosting, ensemble and hybrid techniques might be used to handle class imbalance. Multiple imputation and maximum likelihood methods can be addressed to tackle missing data problem.