Icentia11K: An Unsupervised Representation Learning Dataset for Arrhythmia Subtype Discovery

Shawn Tan1, Guillaume Androz2, Satya Ortiz-Gagné3, Ahmad Chamseddine4, Pierre Fecteau2, Aaron Courville1, Yoshua Bengio1, Joseph Paul Cohen5
1University of Montreal, 2Icentia, 3Mila, 4Polytechnique Montreal, 5Stanford University


We release a public electrocardiogram (ECG) dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats. The signals were recorded with a 16-bit resolution at 250Hz with a fixed chest mounted single lead probe for up to 2 weeks. The average age of the patient is 62.2±17.4 years. 20 technologists annotated each beat's type (Normal, Premature Atrial Contraction, Premature Ventricular contraction) and rhythm (Normal Sinusal Rhythm, Atrial Fibrillation, Atrial Flutter).

To analyse this data we evaluate existing supervised classification methods to replicate their results. We also explore unsupervised representation learning methods to both improve classification performance at small numbers of labelled samples as well as identify arrhythmia subtypes. We present a semi-supervised evaluation framework to evaluate the quality of representation learning methods.

We achieve over 80\% accuracy on beat and rhythm classification tasks using supervised models when training using large numbers of samples. In a low data setting these supervised methods do not work as well (achieving around 40\% accuracy) and the semi-supervised methods we explore only slightly improve performance. This presents an open challenge to develop better ECG representation learning algorithms and the dataset we release is well suited to develop such a method.