Analyzing gene expression data for cancer diagnosis and prognosis using Logic Learning Machine and standard supervised methods

Verda, D; Parodi, S; Ferrari, E; Muselli, M

Motivation: Logic Learning Machine (LLM) is an innovative method of supervised data mining based on an efficient implementation of the Switching Neural Network model. The advantage of LLM with respect to most traditional methods of supervised data analysis is the capability of identifying simple intelligible rules with potential diagnostic and prognostic applications. In particular, LLM was recently applied to extract few highly discriminant rules from a signature of genes related to hypoxic condition for the prognosis of neuroblastoma, a highly fatal childhood cancer. In such analysis LLM outperformed many standard methods of machine learning. Furthermore, the capability of LLM to exploit the complex correlation structure of highly dimensional gene expression data for feature selection tasks and to combine information from clinical features and gene expression for classification purposes was reported in the analysis of both simulated and real data sets. These results indicate that LLM could be a new powerful and flexible tool for the analysis of gene expression data in Oncology setting. However, its accuracy as classifier when applied to a large set of gene expression databases remains to be assessed. Methods: LLM was applied to a large set of publicly available databases of gene expression microarrays, stored in the GEO repository bank (http://www.ncbi.nlm.nih.gov/gds/). Selection criteria were: a) inclusion in the GEO data bank from January 2010 to December 2014; b) presence of at least two classes potentially useful for cancer diagnosis or prognosis, including at least 20 samples each; c) availability of a scientific paper in English language, published on PubMed, and fully describing the experiment and the related study design. Performance of LLM was compared with that of four selected competing methods of supervised learning (Decision Tree, DT, Artificial Neural Network, ANN, Support Vector Machine, SVM, and k-Nearest Network classifier, kNN). In order to control the overfitting bias, comparison was made in leave-one-out cross-validation (LOOCV). Accuracy of each classifier was evaluated by the method of the summary Receiver Operating Characteristic curves (sROC) and a global measure of pure accuracy was obtained by the area sAUC under the corresponding sROC. Nineteen-five 95% Confidence Intervals (95% CI) of sAUC were obtained from the variance of the corresponding summary Odds Ratio (sOR) by exploiting the relation between sAUC and sDOR in a proper ROC model. Results: Fifty-two datasets were retrieved from the GEO web site. After a careful examination of their content and the related documentation, 27 were excluded because they did not fully comply with the selection criteria, thus leaving 25 data sets available for the analyses, corresponding to 37 comparisons (33 two-class and four multiple class). In more details, they included eight comparisons between diagnostic variables, seven among prognostic factors (death or cancer relapse) and 22 related to variables allegedly associated to prognostic factors (namely: tumor grading, stage at diagnosis, and occurrence of some specific mutations). In diagnostic comparisons SVM and LLM clearly outperformed any other method with a similar and very high accuracy in LOOCV (sAUC = 0.95, 95%CI: 0.91 - 0.98, and sAUC = 0.91, 95%CI: 0.85 - 0.94, respectively) (see Fig. 1). ANN and kNN showed on average a poor accuracy (sAUC = 0.56, 95%CI: 0.48 - 0.63 and sAUC = 0.64, 95%CI: 0.54 - 0.74, respectively). Both in prognostic studies and in other comparison studies SVM showed the highest accuracy (AUC = 0.73, 95%CI: 0.67 - 0.78, and AUC = 0.70, 95% CI: 0.67 - 0.74, respectively), whereas all the other methods including LLM had a poor performance.

CNR Institutional Research Information System