Tuyển tập báo cáo các nghiên cứu khoa học quốc tế ngành hóa học dành cho các bạn yêu hóa học tham khảo đề tài: Research Article A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification | Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007 Article ID 87356 9 pages doi 2007 87356 Research Article A Study of Residue Correlation within Protein Sequences and Its Application to Sequence Classification Chris Hemmerich1 and Sun Kim2 1 Center For Genomics and Bioinformatics Indiana University 1001 E. 3rd Street Bloomington 47405-3700 India 2 School of Informatics Center for Genomics and Bioinformatics Indiana University 901 E. 10th Street Bloomington 47408-3912 India Received 28 February 2007 Revised 22 June 2007 Accepted 31 July 2007 Recommended by Juho Rousu We investigate methods of estimating residue correlation within protein sequences. We begin by using mutual information MI of adjacent residues and improve our methodology by defining the mutual information vector MIV to estimate long range correlations between nonadjacent residues. We also consider correlation based on residue hydropathy rather than protein-specific interactions. Finally in experiments of family classification tests the modeling power of MIV was shown to be significantly better than the classic MI method reaching the level where proteins can be classified without alignment information. Copyright 2007 C. Hemmerich and S. Kim. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited. 1. INTRODUCTION A protein can be viewed as a string composed from the 20-symbol amino acid alphabet or alternatively as the sum of their structural properties for example residue-specific interactions or hydropathy hydrophilic hydrophobic interactions. Protein sequences contain sufficient information to construct secondary and tertiary protein structures. Most methods for predicting protein structure rely on primary sequence information by matching sequences representing unknown structures to those with