Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing - Trịnh Tấn Đạt

Bài giảng Khai phá dữ liệu (Data mining): Data preprocessing, chương này trình bày những nội dung về: why preprocess the data; descriptive data summarization; data cleaning; data integration and transformation; data reduction; discretization and concept hierarchy generation; . Mời các bạn cùng tham khảo chi tiết nội dung bài giảng! | Trịnh Tấn Đạt Khoa CNTT Đại Học Sài Gòn Email trinhtandat@ Website https site ttdat88 1 Outline Why preprocess the data Descriptive data summarization Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary 2 Why Data Preprocessing Data in the real world is dirty incomplete lacking attribute values lacking certain attributes of interest . occupation noisy containing errors or outliers . Salary -10 inconsistent containing discrepancies in codes or names . Age 42 Birthday 03 07 1997 . Was rating 1 2 3 now rating A B C . discrepancy between duplicate records 3 Why Is Data Dirty Incomplete data may come from Not applicable data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human hardware software problems Noisy data incorrect values may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation . modify some linked data Duplicate records also need data cleaning 4 Why Is Data Preprocessing Important No quality data no quality mining results Quality decisions must be based on quality data . duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction cleaning and transformation comprises the majority of the work of building a data warehouse 5 Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility 6 Data type Numeric The most used data type and the stored content is numeric Characters and strings strings are arrays of characters Boolean for binary data with true and false values Time series data including time-or sequential-related .

Không thể tạo bản xem trước, hãy bấm tải xuống
TỪ KHÓA LIÊN QUAN
TÀI LIỆU MỚI ĐĂNG
8    103    2    01-05-2024
68    98    5    01-05-2024
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.