Near duplicate document detection survey

Search engines are the major breakthrough on the web for retrieving the information. But List of retrieved documents contains a high percentage of duplicated and near document result. So there is the need to improve the performance of search results. Some of current search engine use data filtering algorithm which can eliminate duplicate and near duplicate documents to save the users’ time and effort. | ISSN:2249-5789 Bassma S Alsulami et al, International Journal of Computer Science & Communication Networks,Vol 2(2), 147-151 Near Duplicate Document Detection Survey Bassma S. Alsulami, Maysoon F. Abulkhair, Fathy E. Eassa Faculty of Computing and Information Technology King AbdulAziz University Jeddah, Saudi Arabia Abstract—Search engines are the major breakthrough on the web for retrieving the information. But List of retrieved documents contains a high percentage of duplicated and near document result. So there is the need to improve the performance of search results. Some of current search engine use data filtering algorithm which can eliminate duplicate and near duplicate documents to save the users’ time and effort. The identification of similar or near-duplicate pairs in a large collection is a significant problem with wide-spread applications. In this paper survey present an up-to-date review of the existing literature in duplicate and near duplicate detection in Web. Keyword—Duplicate document, near duplicate pages, near duplicate detection, Detection approaches 1. INTRODUCTION Information on the Web is very huge in size. There is a need to use this big volume of information efficiently for effectively satisfying the information need of the user on the Web. Search engines become the major breakthrough on the web for retrieving the information. Where, among users looking for information on the Web, 85% submit information requests to various Internet search engines. Search engines are critically important to help users find relevant information on the Web. Search engines in response to a user's query typically produces the list of documents ranked according to closest to the user's request. These documents are presented to the user for examination and evaluation. Web users have to go through the long list and inspect the titles, and snippets sequentially to recognize the required results. Filtering the search engines' results consumes the users' effort and

Không thể tạo bản xem trước, hãy bấm tải xuống
TÀI LIỆU MỚI ĐĂNG
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.