Unravelling the genetics of complex diseases through random forest: application to genome wide Association of asthma in a genetic isolate of Ogliastra

Pirastu, N; Cabras, S; Casula, L; Castellanos, Me; Persico, I; Sassu, A; Gbiino,; SR Del Giacco,; Pirastu, M

As of today, the association studies of genetic variants and common complex diseases requires the examination of a huge number of samples in order to identify variants with limited weight in the insurgence of the disease. This has had limited success if we consider what was expected. One of the problems could be the lack of adequate statistical methods apt to reveal biological mechanisms complexities, their interaction with the environment and with different life styles. In our work we propose a procedure based on ensemble methods which solves some problems linked to association studies such as: false associations due to LD, multiple testing, elicitation of a genetic mode of inheritance, the detection and definition of variant/variant and variant/environment interactions. Moreover, we apply this procedure to a complex disease (Asthma) in a small village located in Ogliastra a secluded area of Sardinia. First we identified, through a large screening on the whole village of Talana, 57 Asthma cases. The clinical study was carried out by specialized physicians according to international guidelines based on: ECRHS short screening questionnaire, spirometry, nitric oxide breath test, skin tests with allergens, and measurements of IgE in serum. As potential controls we chose people who were negative to all the parameters used for Asthma diagnosis and that were not under anti-asthmatic therapy. We identified 191 such controls. One of the main problems related to association studies in genetic isolates is the non-independence of the subjects included in the analysis which is a prerequisite for the application of most statistical techniques. This may lead to "population stratification" effects caused by differences in relatedness between the group of cases and controls. Different approaches have been proposed to avoind this problem. However these techniques are either too conservative (Genomic Control [3] ) or have limited application to some kind of statistical tests. For this reason we selected for each case the most related control, this way we should both avoid population stratification and also reduce the number of false positive variants. In fact the presence of IBD regions between each case and each control not linked to the disease should prevent false association due to chance. In order to find the solution that maximizes the kinship of all of the subjects we used the Hungarian method [4] which is commonly used to find the best solution in assignment problems. This way we selected 57 control samples matched to the 57 cases identified through the screening. All of the subjects had been previously been genotyped with the Affymetix GeneChip Human Mapping 500K Array. We also included non genetic variables possibly related to asthma such as sex , smoking and physical activity. In order to verify if this approach succeed in avoiding population stratification in our sample, we performed fisher test of dependency among markers and desease status. We then calculated the lambda coefficient which measures the compatibility of uniformity assumptions on p-values according to the genomic control method [3]. The calculated lambda was equal to 0.3, indicating that not only there was no population stratification between cases and controls, but also that genome wide the p-values were higher than expected. Therefore, we can assume that the p-values are less significant in loci which are not related to the disease while they should be unaffected in loci associated to Asthma. The identification of genes involved in complex disease is composed essentially of two main concerns: variable selection and model elicitation. Ensemble methods based on classification trees solve both these problems at the same time, in fact they provide a measure of importance for each variable and in also provide a model which can predict the affection status [1][2]. As a result it also detects the interaction between the variable used for the prediction. The measure of importance of the variables should be a much better index of association because it evaluates the predictive properties of each variable instead of calculating what the probability of no association is. Moreover this approach considers association structures between markers more complex than those usually used using interactions of orders higher than the second with the disease. These interactions are a much more realistic interpretation of the biological processes involved and of their interaction with the environment. The dependency structure of markers is estimated, and then fixed, conditionally on available data. This is in contrast to usual methods that employs multiple testing techniques which leaves unspecified the dependency assumptions. Our method has more specificity and sensitivity than multiple testing approach. Moreover, with the proposed method we decrease the false positive signals due to linkage disequilibrium. Indeed, considering all markers together at the same time we reduce the risk of association by simple nearness to a "real" causative marker. It is well known that biological processes are rarely of the addictive type: more often they are interactive. Nevertheless, most approaches used in literature have considered one variable at a time and, rarely, a dozen at a time. In order to asses if this approach is able to detect genes related to Asthma we applied it to the previously selected sample. The procedure uses bootstrap aggregation (bagging), a particular case of Random Forest [1], for identifying the important variables in the data set recursively and removing at each step those variables which are unimportant for the prediction of the disease status. To avoid over-fitting of the produced model we chose as stop criterion for the tree building, either if a node is 10% of the initial sample or if the impurity of the node is less or equal to 10% [1]. This way we identified 263 markers which can predict the affection status with precision. Although the produced model was constructed on a small sample and needs to be verified on a larger dataset, these results are encouraging for many reasons. The number of intragenic markers is higher than expected, in fact most SNPs in the original data set are not located inside genes while most of the identified markers are. Moreover one of the most important markers is located inside a gene coding for a cholinergic receptor which could be easily associated to Asthma functionally. This procedure can be used on large numbers of markers and can take into account variables of different types either qualitative or quantitative. The identified markers are a small number if we consider the initial data set and can be replicated without large expenses. We feel that the proposed method could give an important contribution to the study of genetic predisposition to common complex diseases and that it could resolve many of the problems linked to these type of studies