Approach for Machine-Printed Arabic Character Recognition: the-state-of-the-art deep-learning method

Daegun Ko; Changhyung Lee; Donghyeop Han; Hyeongsu Ohk; Kimin Kang; Seongwook Han

doi:10.2352/ISSN.2470-1173.2018.2.VIPC-176

Abstract

Optical character recognition (OCR) automatically recognizes texts in an image and converts them into machine codes such as ASCII or Unicode. Compared to many research studied on OCR for other languages, recognizing Arabic language is still a challenging problem due to character connection and segmentation issues. In this work, we propose a deep-learning framework of recognizing Arabic characters based on the multi-dimensional bi-direction long short-term memory (MD-BLSTM) with connectionist temporal classification (CTC). To train this framework, we generate over one-million Arabic text-line images dataset that contains Arabic digits, basic Arabic forms with isolated shape and connected forms. To compare the results, we also measure the performance of other OCR software such as Tesseract made by Hewlett-Packard and Google Inc. Tesseract version 3 and version 4 are used. Results show that deep-learning method outperforms the conventional methods in terms of recognition error rate, although the Tesseract_3.0 system was faster.

72010604

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

10.2352/ISSN.2470-1173.2018.2.VIPC-176

2470-1173(20180128)2018:2L.1761;1-

s9.phd

/ist/ei/2018/00002018/00000002/art00009

Articles

Approach for Machine-Printed Arabic Character Recognition: the-state-of-the-art deep-learning method

KoDaegun

LeeChanghyung

HanDonghyeop

OhkHyeongsu

KangKimin

HanSeongwook

28012018

2018

176-1

176-8

2018

DEEP-LEARNINGLONG SHORT-TERM MEMORYCONNECTIONIST TEMPORAL CLASSIFICATIONTESSERACTARABIC CHARACTER RECOGNITIONOCR PERFORMANCE

articleview.keywords