Tuyển tập báo cáo các nghiên cứu khoa học quốc tế ngành hóa học dành cho các bạn yêu hóa học tham khảo đề tài: Research Article Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation | Hindawi Publishing Corporation EURASIP Journal on Bioinformatics and Systems Biology Volume 2007 Article ID 16354 11 pages doi 2007 16354 Research Article Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation Yufei Xiao 1 Jianping Hua 2 and Edward R. Dougherty1 2 1 Department of Electrical and Computer Engineering Texas A M University College Station TX 77843 USA 2 Computational Biology Division Translational Genomics Research Institute Phoenix AZ 85004 USA Received 7 August 2006 Revised 21 December 2006 Accepted 26 December 2006 Recommended by Paola Sebastiani Given the relatively small number of microarrays typically used in gene-expression-based classification all of the data must be used to train a classifier and therefore the same training data is used for error estimation. The key issue regarding the quality of an error estimator in the context of small samples is its accuracy and this is most directly analyzed via the deviation distribution of the estimator this being the distribution of the difference between the estimated and true errors. Past studies indicate that given a prior set of features cross-validation does not perform as well in this regard as some other training-data-based error estimators. The purpose of this study is to quantify the degree to which feature selection increases the variation of the deviation distribution in addition to the variation in the absence of feature selection. To this end we propose the coefficient of relative increase in deviation dispersion CRIDD which gives the relative increase in the deviation-distribution variance using feature selection as opposed to using an optimal feature set without feature selection. The contribution of feature selection to the variance of the deviation distribution can be significant contributing to over half of the variance in many of the cases studied. We consider linear-discriminant analysis 3-nearest-neighbor and linear support .