Statistical and machine learning applications are increasingly popular in animal breeding and genetics, especially to compute genomic predictions for phenotypes of interest. Noise (errors) in the data may have a negative impact on the accuracy of predictions. The effects of noisy data have been investigated in genome-wide association studies for case-control experiments, and in genomic predictions for binary traits in plants. No studies have been published yet on the impact of noisy data in animal genomics. In this work, the susceptibility to noise of five classification models (Lasso-penalised logistic regression--Lasso, K-nearest neighbours--KNN, random forest--RF, support vector machines with linear--SVML--or radial--SVMR--kernel) was tested. As illustration, the identification of carriers of a recessive mutation in cattle (Bos taurus) was used. A population of 3116 Fleckvieh animals with SNP genotypes on the same chromosome as the mutation locus (BTA 19) was available. The carrier status (0/1 phenotype) was randomly sampled to generate noise. Increasing proportions of noise--up to 20%-- were introduced in the data.
The effect of mislabeled phenotypic status on the identification of mutation-carriers from SNP genotypes in dairy cattle
Stefano Biffani;Filippo Biscarini
2017
Abstract
Statistical and machine learning applications are increasingly popular in animal breeding and genetics, especially to compute genomic predictions for phenotypes of interest. Noise (errors) in the data may have a negative impact on the accuracy of predictions. The effects of noisy data have been investigated in genome-wide association studies for case-control experiments, and in genomic predictions for binary traits in plants. No studies have been published yet on the impact of noisy data in animal genomics. In this work, the susceptibility to noise of five classification models (Lasso-penalised logistic regression--Lasso, K-nearest neighbours--KNN, random forest--RF, support vector machines with linear--SVML--or radial--SVMR--kernel) was tested. As illustration, the identification of carriers of a recessive mutation in cattle (Bos taurus) was used. A population of 3116 Fleckvieh animals with SNP genotypes on the same chromosome as the mutation locus (BTA 19) was available. The carrier status (0/1 phenotype) was randomly sampled to generate noise. Increasing proportions of noise--up to 20%-- were introduced in the data.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.