RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data

Thibaud Godon, Pier-Luc Plante, Baptiste Bauvin, Élina Francovic-Fontaine, Alexandre Drouin, François Laviolette, Jacques Corbeil

July 2021

Abstract

Recent metabolomics measurement devices, such as mass spectrometers, produce extremely high-dimensional data. Together with small sample sizes, this setting is known as the fat data (or p » n) problem. Biomarker discovery in this configuration is a challenge. Classical statistical methods fail and common Machine Learning (ML) algorithms produce models too complex to be interpretable. ML algorithms that rely on sparsity to predict phenotypes using very few covariates have been shown to thrive in this setting. While sparsity helps to avoid overfitting, it also leads to concise models that are easier to interpret for biomarker discovery.

The Set Covering Machine (SCM) algorithm produces sparse models based on simple decision rules. Recent work has applied SCMs to the genotype-to-phenotype prediction of antibiotic resistance and achieved state-of-the-art accuracy. To adapt this approach to metabolomics (fat) data, we developed a bootstrap aggregation of SCM models: RandomSCM.

We explored applications of RandomSCM beyond genotype-to-phenotype prediction by applying it to five metabolomics datasets. Predictions performances are at state-of-the-art level. Furthermore, the study of the decision rules in RandomSCM revealed valid biomarkers of the phenotypes. These results demonstrate the high potential of the RandomSCM algorithm for biomarker discovery in omics sciences.

Type

Conference paper

Publication

Intelligent Systems for Molecular Biology and European Conference on Computational Biology

Alexandre Drouin

Head of Frontier AI Research

Head of Frontier AI Research at Frontier AI Research located at Montreal, QC, Canada.

RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data

Abstract

Alexandre Drouin

Head of Frontier AI Research​

Head of Frontier AI Research