Tuyển tập các báo cáo nghiên cứu về y học được đăng trên tạp chí y học Wertheim cung cấp cho các bạn kiến thức về ngành y đề tài: Modeling non-uniformity in short-read rates in RNA-Seq data. | Li et al. Genome Biology 2010 11 R50 http 2010 11 5 R50 w Genome Biology METHOD Open Access Modeling non-uniformity in short-read rates in RNA-Seq data Jun Li 1 Hui Jiang1 2 and Wing Hung Wong 1 3 Abstract After mapping RNA-Seq data can be summarized by a sequence of read counts commonly modeled as Poisson variables with constant rates along each transcript which actually fit data poorly. We suggest using variable rates for different positions and propose two models to predict these rates based on local sequences. These models explain more than 50 of the variations and can lead to improved estimates of gene and isoform expressions for both Illumina and Applied Biosystems data. Background Microarrays are an efficient technology to measure the expression levels of many genes simultaneously but there are some limitations to this method. The expression estimates are typically not reliable for lowly expressed genes because the true signals are masked by cross-hybridization effects 1 2 . Furthermore the design of the array depends on annotation of gene structures and thus the method is not ideal for the discovery of novel splicing events. A recently developed alternative approach called RNA-Seq has the potential to overcome these difficulties 3 . RNA-Seq uses ultra-high-throughput sequencing 4 to determine the sequence of a large number of cDNA fragments. The resulting sequences reads can be long 100 nucleotides or short depending on the platform 4 . Two currently popular short-read platforms are Illumina s Solexa 5-11 and Applied Biosystems ABI s SOLiD 12 . Each can produce tens of millions of short reads in a single run 5-12 . In this paper we only consider the short-read RNA-Seq. The reads produced by RNA-Seq are first mapped to the genome and or to the reference transcripts using computer programs. Then the output of RNA-Seq can be summarized by a sequence of counts . That is for each position in the genome or on a putative transcript it gives a .