Deep Feature Learning for Early Disease Prediction

Jia Yao1, Ming Lun Ong2, Kar Kin Mun1, Shiyu Liu1, Mehul Motani1
1NUS/ECE, 2National University of Singapore


Artificial intelligence (AI) based predictive diagnosis is enabled by high quality patient health data. The heterogeneity of patient data and the lack of efficient feature extraction methods limit the effectiveness of applying AI techniques in healthcare. Medical measurements (e.g. vital signs) with various sampling intervals and missing observations make the patient data highly irregular in structure and complicate the data processing and information extraction processes. Moreover, discovering and learning latent information from patient health data is critical in facilitating clinical decision making and predictive diagnosis applications. However, traditional feature extraction methods lack the effectiveness in such applications. Manual feature selection relies on expert domain knowledge to look for patterns in an ad-hoc manner. Supervised feature extraction requires a large volume of labelled data, which is costly and time consuming, and sometimes the extracted features do not generalize well. In this paper, to overcome these challenges, we propose a deep feature learning (DFL) framework to learn feature representations in an unsupervised manner. DFL learns compact representations from patient data automatically via stacked autoencoders for efficient prediction. The design of the stacked autoencoders allows them to learn both spatial and temporal features in the data and leads to effective prediction performance. We report cross-validation results on publicly available EEG data (from the UCI Machine Learning Repository) and on the publicly available Physionet 2019 training dataset. On the EEG dataset, we can successfully classify patient diagnosis from their EEG with an accuracy of 0.99 and F1-score 0.99. The Physionet 2019 sepsis dataset contains missing data, which we impute with Gaussian process regression. Our initial experiments for sepsis prediction give an accuracy of 0.7 with an F1-score of 0.8. We expect to be able to improve the results on the Physionet dataset in the coming months.