Trong chương này, chúng tôi sẽ xem xét các kỹ thuật cơ bản từ khám phá tri thức trong cơ sở dữ liệu (KDD), khai thác dữ liệu (DM), và học máy (ML) phù hợp cho các ứng dụng độc tố tiên đoán. Chúng tôi sẽ thảo luận chủ yếu phương pháp có khả năng cung cấp hiểu biết mới và lý thuyết. Phương pháp làm việc tốt cho mục đích dự báo nhưng không quay trở lại mô hình dễ hiểu về kiến thức độc tính (ví dụ như, nhiều connectionist và cách tiếp cận đa biến), sẽ không. | 7 Machine Learning and Data Mining STEFAN KRAMER CHRISTOPH HELMA Institut fur Informatik Technische Institute for Computer Science Universitat Munchen Garching Universitat Freiburg Georges Kohler Munchen Germany Allee Freiburg Germany 1. INTRODUCTION In this chapter we will review basic techniques from knowledge discovery in databases KDD data mining DM and machine learning ML that are suited for applications in predictive toxicology. We will discuss primarily methods which are capable of providing new insights and theories. Methods which work well for predictive purposes but do not return models that are easily interpretable in terms of toxicological knowledge . many connectionist and multivariate approaches will not be discussed here but are discussed elsewhere in this book. Also not included in this chapter yet important are visualization techniques which are valuable for giving first 223 2005 by Taylor Francis Group LLC 224 Kramer and Helma clues about regularities or errors in the data. The chapter will feature data analysis techniques originating from a variety of fields such as artificial intelligence databases and statistics. From artificial intelligence we know about the structure of search spaces for patterns and models and how to search them efficiently. Database literature is a valuable source of information about efficient storage of and access to large volumes of data provides abstractions of data management and has contributed the concept of query languages to data mining. Statistics is of utmost importance to data mining and machine learning since it provides answers to many important questions arising in data analysis. For instance it is necessary to avoid flukes that is patterns or models that are due to chance and do not reflect structure inherent in the data. Also the issue of prior knowledge has been studied to some extent in the statistical literature. One of the most important lectures in data analysis is that one cannot be too cautious .