Background: Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genome-wide association studies for case-control experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding. Results: In this study, the sensitivity to noise in the data of five classification methods (K-nearest neighbours-KNN, random forest-RF, ridge logistic regression-LR, and support vector machines with linear or radial basis function kernels) was investigated. A sugar beet population of 123 plants phenotyped for a binary trait and genotyped for 192 SNP (single nucleotide polymorphism) markers was used. Labels (0/1 phenotype) were randomly sampled to generate noise. From the base scenario without errors in the labels, increasing proportions of noisy labels-up to 50 %-were generated and introduced in the data. Conclusions: Local classification methods-KNN and RF-showed higher tolerance to noisy labels compared to methods that leverage global data properties-LR and the two SVM models. In particular, KNN outperformed all other classifiers with AUC (area under the ROC curve) higher than 0.95 up to 20 % noisy labels. The runner-up method, RF, had an AUC of 0.941 with 20 % noise.

"Noisy beets": Impact of phenotyping errors on genomic predictions for binary traits in Beta vulgaris

Biscarini F.
Primo
;
Broccanello C.;
2016

Abstract

Background: Noise (errors) in scientific data is endemic and may have a detrimental effect on statistical analyses and experimental results. The effects of noisy data have been assessed in genome-wide association studies for case-control experiments in human medicine. Little is known, however, on the impact of noisy data on genomic predictions, a widely used statistical application in plant and animal breeding. Results: In this study, the sensitivity to noise in the data of five classification methods (K-nearest neighbours-KNN, random forest-RF, ridge logistic regression-LR, and support vector machines with linear or radial basis function kernels) was investigated. A sugar beet population of 123 plants phenotyped for a binary trait and genotyped for 192 SNP (single nucleotide polymorphism) markers was used. Labels (0/1 phenotype) were randomly sampled to generate noise. From the base scenario without errors in the labels, increasing proportions of noisy labels-up to 50 %-were generated and introduced in the data. Conclusions: Local classification methods-KNN and RF-showed higher tolerance to noisy labels compared to methods that leverage global data properties-LR and the two SVM models. In particular, KNN outperformed all other classifiers with AUC (area under the ROC curve) higher than 0.95 up to 20 % noisy labels. The runner-up method, RF, had an AUC of 0.941 with 20 % noise.
2016
Istituto di Biologia e Biotecnologia Agraria - IBBA
Binomial phenotype
Classification
Genomic predictions
K-nearest neighbours (KNN)
Noisy data
Random forest (RF)
Ridge logistic regression
Robustness to errors
Sugar beet
Support vector machines (SVM)
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/500182
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ente

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact