Lưu ý rằng việc phân loại trên không phải là loại trừ lẫn nhau. Ví dụ, việc loại bỏ các dữ liệu dư thừa có thể được xem như là một hình thức làm sạch dữ liệu, cũng như giảm bớt dữ liệu. Tóm lại, các dữ liệu thực tế có xu hướng bị bẩn, không đầy đủ, và không phù hợp. Kỹ thuật tiền xử lý dữ liệu có thể cải thiện chất lượng của dữ liệu, qua đó giúp cải thiện tính chính xác và hiệu quả của. | 50 Chapter 2 Data Preprocessing Data cleaning Data integration -2 32 100 59 48 - Data transformation attributes T1456 T1 T4 Figure Forms of data preprocessing. a form of data reduction that is very useful for the automatic generation of concept hierarchies from numerical data. This is described in Section along with the automatic generation of concept hierarchies for categorical data. Figure summarizes the data preprocessing steps described here. Note that the above categorization is not mutually exclusive. For example the removal of redundant data may be seen as a form of data cleaning as well as data reduction. In summary real-world data tend to be dirty incomplete and inconsistent. Data preprocessing techniques can improve the quality of the data thereby helping to improve the accuracy and efficiency of the subsequent mining process. Data preprocessing is an Descriptive Data Summarization 51 important step in the knowledge discovery process because quality decisions must be based on quality data. Detecting data anomalies rectifying them early and reducing the data to be analyzed can lead to huge payoffs for decision making. Descriptive Data Summarization For data preprocessing to be successful it is essential to have an overall picture of your data. Descriptive data summarization techniques can be used to identify the typical properties of your data and highlight which data values should be treated as noise or outliers. Thus we first introduce the basic concepts of descriptive data summarization before getting into the concrete workings of data preprocessing techniques. For many data preprocessing tasks users would like to learn about data characteristics regarding both central tendency and dispersion of the data. Measures of central tendency include mean median mode and midrange while measures of data dispersion include quartiles interquartile range IQR and variance. These descriptive statistics are of great help in