Tuyển tập báo cáo các nghiên cứu khoa học quốc tế ngành hóa học dành cho các bạn yêu hóa học tham khảo đề tài: Research Article Aligning Sequences by Minimum Description Length | Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007 Article ID 72936 14 pages doi 2007 72936 Research Article Aligning Sequences by Minimum Description Length John S. Conery Department of Computer and Information Science University of Oregon Eugene OR 97403 USA Received 26 February 2007 Revised 6 August 2007 Accepted 16 November 2007 Recommended by Peter Grunwald This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment alignments produced with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark. Copyright 2007 John S. Conery. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited. 1. INTRODUCTION Sequence alignment is a fundamental operation in bioinformatics used in a wide variety of applications ranging from genome assembly which requires exact or nearly exact matches between ends of small fragments of DNA sequences 1 to homology search in sequence databases which involves pairwise local alignment of DNA or protein sequences