A new method based on clustering improves the efficiency of imbalanced data classification

In this paper, in order to increase the accuracy of the prediction model in imbalanced data classification problem, we propose a new cluster-based sampling method to address this work. Performing tests on a number of datasets, we have achieved important results when compared to cases without using any data balancing strategies and previous method. | HNUE JOURNAL OF SCIENCE Natural Sciences 2020 Volume 65 Issue 4A pp. 33-41 This paper is available online at http A NEW METHOD BASED ON CLUSTERING IMPROVES THE EFFICIENCY OF IMBALANCED DATA CLASSIFICATION Nguyen Thi Hong and Dang Xuan Tho Faculty of Information Technology Hanoi National University of Education Abstract. Classification of data imbalance is an important problem in practice and is becoming a new approach for many researchers. In particular in the diagnosis of medicine the number of ill people accounts for only a very small percentage of the total number of people so the ability to detect people with many difficulties or major deviations causing serious consequences even affect human life. Therefore the efficiency of classification imbalance requires high accuracy and the preprocessing method of data is a common solution with good results. This paper will introduce some approaches in imbalanced data classification propose a new method based on cluster data. We have installed this method and experimented on UCI international data sets Blood Glass Haberman Heart Pima and Yeast. For example the result of classification with Yeast data G-mean of original data is but when applying the new method it has increased to . The experimental results show that the new method increases the classification efficiency of data significantly. Keywords imbalanced data classification Data mining Clustering based undersampling. 1. Introduction Many classification algorithms published such as k-nearest neighbors Decision trees Naïve Bayes Support vector machines. These are the standard algorithms applied to balance classification cases and has been tested experimentally. However applying these algorithms to data where the large disparity in the number of samples in classes is not effective 1-3 . Therefore new approaches need to be taken in case of data imbalance. A data imbalance is a case where data have a significant difference in the number of

Không thể tạo bản xem trước, hãy bấm tải xuống
TÀI LIỆU MỚI ĐĂNG
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.