Vietnamese text extraction from book covers

In this paper, we present a novel method for the Vietnamese text extraction from images of scanned book covers. The proposed system accepts the book covers snapshot, filters the input image for an enhancement of quality, locates the regions with text, then utilizes the optical character recognizer (OCR) to extract the text. | TẠP CHÍ KHOA HỌC ĐẠI HỌC ĐÀ LẠT Tập 7, Số 2, 2017 142–152 142 VIETNAMESE TEXT EXTRACTION FROM BOOK COVERS Phan Thi Thanh Ngaa, Nguyen Thi Huyen Tranga, Nguyen Van Phucb, Thai Duy Quyc, Vo Phuong Binha* a The Faculty of Information Technology, Dalat University, Lamdong, Vietnam b The Devsoft Company, Hochiminh City, Vietnam c The Research Management and International Cooperation Department, Dalat University, Lamdong, Vietnam Article history Received: January 09th, 2017 | Received in revised form: April 19th, 2017 Accepted: May 11th, 2017 Abstract Automatic information extraction from images reduces the cost, human interference, and timely processing. Converting printed book covers to readable text for later automation process would be useful for a wide range of users such as librarians, bookshop keepers, and individual users. In this paper, we present a novel method for the Vietnamese text extraction from images of scanned book covers. The proposed system accepts the book covers snapshot, filters the input image for an enhancement of quality, locates the regions with text, then utilizes the optical character recognizer (OCR) to extract the text. The last step is to filter the extracted text in accompany with at dictionary to achieve the final text result. Carrying out the experiments with the proposed system using our dataset delivered encouraging experimental results. Keywords: Book cover; OCR (Optical Character Recognition); Text information extraction; Vietnamese text detection. 1. INTRODUCTION Before the existence of computers, books were often purely manually categorized using library classification systems, commonly known as DDC (Dewey Decimal Classification), LCC (Library of Congress Classification), CC (Colon classification), UDC (Universal Decimal Classification). However, these considerably disciplinary systems are preferable by librarians when people tend not to follow the hierarchy structure in their daily archive. Thanks to the help of .

Không thể tạo bản xem trước, hãy bấm tải xuống
TÀI LIỆU MỚI ĐĂNG
Đã phát hiện trình chặn quảng cáo AdBlock
Trang web này phụ thuộc vào doanh thu từ số lần hiển thị quảng cáo để tồn tại. Vui lòng tắt trình chặn quảng cáo của bạn hoặc tạm dừng tính năng chặn quảng cáo cho trang web này.