Recent metabolomics measurement devices, such as mass spectrometers, produce extremely high-dimensional data. Together with small sample sizes, this setting is known as the fat data (or p » n) problem. Biomarker discovery in this configuration is a challenge. Classical statistical methods fail and common Machine Learning (ML) algorithms produce models too complex to be interpretable. ML algorithms that rely on sparsity to predict phenotypes using very few covariates have been shown to thrive in this setting. While sparsity helps to avoid overfitting, it also leads to concise models that are easier to interpret for biomarker discovery.
The Set Covering Machine (SCM) algorithm produces sparse models based on simple decision rules. Recent work has applied SCMs to the genotype-to-phenotype prediction of antibiotic resistance and achieved state-of-the-art accuracy. To adapt this approach to metabolomics (fat) data, we developed a bootstrap aggregation of SCM models: RandomSCM.
We explored applications of RandomSCM beyond genotype-to-phenotype prediction by applying it to five metabolomics datasets. Predictions performances are at state-of-the-art level. Furthermore, the study of the decision rules in RandomSCM revealed valid biomarkers of the phenotypes. These results demonstrate the high potential of the RandomSCM algorithm for biomarker discovery in omics sciences.