Ensemble of Heterogeneous Learners for Genomic Classification

Giordano, M; Guarracino, Mr; Tripathi, Kp

The sbv IMPROVER published a call for participation in the SysTox Computation Challenge [1]. Participants were asked to develop models to classify subjects as smokers versus non-current smokers, and then former smokers versus never smokers, based on the information from whole blood gene expression data from human, or human and rodent. The first condition of the challenge was that proposed models had to be inductive, that is, once the model had been developed based on training data, classification on each test sample could be carried out only with the previously developed model, without retraining. Inductive models are opposite to transductive models in which training and test set processed together and used to retrain models prior to classification prediction. Another rule of the challenge was that classification models should rely only on a small subset of genes (less than 40) from whole blood gene expression. The sbv IMPROVER SysTox Computation Challenge [1] asked participants to develop inductive models to classify subjects as smokers vs non-current smokers, and then former smokers vs never smokers, based on the information from a small subset of genes (<40) selected from the whole blood gene expression data from human, or human and rodent. In this paper we approached the SysTox challenge tasks, with a 3-step methodology: 1) by using cross-validation on training data we searched the most relevant gene expressions (out of thousands of genes) by ranking them as weights of an SVM classifier [9]; 2) we chose a set of heterogeneous ML methods (from the Scikit-Learn toolset[2]), to find out the more efficient learners in cross-validation on training data, when considering only expressions of more relevant genes (as ranked in the previous step) and by varying the gene signature size from ten to one hundred; 3) we build a meta-learner consisting in an heterogeneous ensemble of the more efficient learners; during prediction, each trained learner, did independent responses on test samples and the response of the ensemble is computed by soft voting of the individual responses of components. Experimental results proved how the proposed ensemble learning method is a viable and competing approach to classification in genomic domain. The biological interpretation of obtained gene signatures has been carried out by functional annotation analysis using DAVID web services [3] while for the analysis of these genes for their possible role or association with smoking effects, we utilized Comparative Toxicogenomics Database [4].

CNR Institutional Research Information System