Protein families comparisons using repeatome-based profiling

Loredana, M Genovese; Geraci, Filippo; Pellegrini, Marco

Motivation: Protein architectures form a complex multilayered hierarchy.The primary linear sequence of amino acids residues arranges itself in 3-dimensional space so to form localstructures (secondary and super-secondary structures, and extends up to fully functional folded proteins(tertiary and quaternary structures) with their functional characterization.For a majority of proteins only the primary AA sequence is known reliably,while the most valuable characterization in structural ond/or functional termsis routinely attained with the use of prediction tools that try to find matching homologous proteinswithin databases of validated structural/functional hierarchies (e.g. SCOP, CATH).As remarked in [Simossis and Heringa 2006], at the moment no systematic analysis has been done on howincorporating repetitive features of the primary sequence might help in improving alignment quality ofhomologous proteins (and protein families) matching.Here we report initial findings in the direction of repeatome-based profiling of protein families withthe aim of improving current alignment/matching technologies and classification methods.Methods: PTRStalker [Pellegrini et al. 2012] is an algorithm designed to detect Fuzzy TR (FTR) in protein sequences(20AA alphabet). Using PTRStalker as a black-box we compute a FTR-profile for a protein P by(a) detect the set FTR(P) of FTR in P (b) compute mean of the FTR over ten random shuffling of P(c) remove from FTR(P) all TR of length smaller than the mean computed at (b).The statistically filtered FTR are then turned into a vector of features that includethe length of P, the length of all FTR after the statistical filtering in the order of appearance along the protein,and the features of the background random shuffling FTR distribution (mean and max values).This FTR-descriptor for the protein P can be used in different ways.In the next section we report good performance of this descriptor in a direct characterization ofstructured and unstructured proteins. Also we have used this descriptor together with the Euclidean metricto perform unsupervised learning (clustering) of SCOP protein families obtaining highly homogeneous clusters.As next step we plan to apply this new protein descriptor in conjunction with other descriptors(primary sequence, secondary structure, etc.) in the framework of Chung and Yona 2004 in order toimprove the prediction of distant homologies among protein families by augmenting family profiles with FTR descriptors.Results: We have tested three validated data sets.The first data set (DS1) is a collection of 92 sequences covering 54037 bps from [Walsh et al. 2012]corresponding to 18725 bps validated secondary structures (mostly solenoids).This benchmark is intended to measure the capability of PTRStalkerin detecting existing secondary structures. After statistical filteringPTRStalker returns 95 Fuzzy Tandem Repeats of which 67 overlap known SS in DS1.In terms of base counts the reported FTR cover 17544 bases of which 11594 cover known SS in DS1 (recall: 0.62, precision: 0.66).The second data set (DS2) is a collection of 105 proteins fromthe database DisProt classified as 100% disordered. The rationale of the experiment is that disordered protein shouldbe relatively free of long tandem repeats. We split the data in three groups of 35 protein each,of length range [45-110][111-208], and [>209] and in each class we tested the hypothesis that the disorderedproteins of that length class have FTR statistically equivalent from that of randomly shuffled proteins.The Wilcoxon signed rank test on the length of the longest FTR found in each of the three classesare respectively: 0.199, 0.135 and 0.008. This result implies that such unstructured proteins are indeedfree of significant FTR at least up to length 200. This measure is in line with the findings of experiment on DS1.The third data set (DS3) is composed of 507 non redundantproteins in 6 SCOP superfamilies from [Paccanaro et al. 2006]selected as a challenge for clustering algorithm.Within any superfamily protein pairs have high sequence divergence, but high structural similarity.For each protein we build (see section methods) a descriptor or its FTR profile, including alsothe background as measured by random shuffling the proteins sequences.Clustering made with the tool Amica [Geraci et al. 2008] using Euclidean distance and a target of 30clusters has produced 26 highly homogeneous clusters at the superfamily level (with hypergeometric test p-value < 0.004,with BHY FDR adjustment for multiple testing.)covering 90% of the input set. This experiment implies that FTR characterization of proteinsis a promising new feature that can be used in novel clustering and classification tasksAvailability: http://Contact E-Mail: marco.pellegrini@iit.cnr.itInfo: Simossis, V.A. and Heringa, J. (2006). Local structure prediction of proteins.In: Computational Methods for Protein Structure Prediction and Modeling(Xu, Y., Xu, D., Liang J, Eds.), Springer-Verlag, GmbH. Chung R, Yona G. (2004)Protein family comparison using statistical models and predicted structural information.BMC Bioinformatics. Nov 25;5:183. Walsh, Ian and Sirocco, Francesco G. and Minervini, Giovanni and Di Domenico, Tomás and Ferrari, Carloand Tosatto, Silvio C.E. (2012).RAPHAEL: Recognition, periodicity and insertion assignment of solenoid protein structures.Bioinformatics. 10.1093/bioinformatics/bts550 M. Pellegrini, and M. Elena Renda and A. Vecchio.Ab Initio Detection of Fuzzy Amino Acid Tandem Repeats in Protein Sequences.BMC Bioinformatics 2012, Vol. 13(Suppl 3):S8, doi:10.1186/1471-2105-13-S3-S8. March 2012. Sickmeier M, Hamilton JA, LeGall T, Vacic V, Cortese MS, Tantos A, Szabo B, Tompa P, Chen J, Uversky VN, Obradovic Z, Dunker AK.DisProt: the Database of Disordered Proteins.Nucleic Acids Res. 2007 Jan;35(Database issue):D786-93. A. Paccanaro, J.A. Casbon, M.A.S. Saqi. (2006).Spectral clustering of protein sequencesNucleic acids research 34 (5), 1571-1580 F. Geraci, M. Pellegrini, E. Renda. AMIC@: All MIcroarray Clusterings @ once.Nucleic Acids Research , Vol. 36, Web Server Issue W315~W319, 2008.

CNR Institutional Research Information System