72010604

Electronic Imaging

2470-1173

Society for Imaging Science and Technology

10.2352/ISSN.2470-1173.2018.2.VIPC-176

2470-1173(20180128)2018:2L.1761;1-

s9.phd

/ist/ei/2018/00002018/00000002/art00009

Articles

Approach for Machine-Printed Arabic Character Recognition: the-state-of-the-art deep-learning method

Daegun

Lee

Changhyung

Han

Donghyeop

Ohk

Hyeongsu

Kang

Kimin

Han

Seongwook

28 01 2018

2018 2 176-1 176-8

2018

Optical character recognition (OCR) automatically recognizes texts in an image and converts them into machine codes such as ASCII or Unicode. Compared to many research studied on OCR for other languages, recognizing Arabic language is still a challenging problem due to character connection and segmentation issues. In this work, we propose a deep-learning framework of recognizing Arabic characters based on the multi-dimensional bi-direction long short-term memory (MD-BLSTM) with connectionist temporal classification (CTC). To train this framework, we generate over one-million Arabic text-line images dataset that contains Arabic digits, basic Arabic forms with isolated shape and connected forms. To compare the results, we also measure the performance of other OCR software such as Tesseract made by Hewlett-Packard and Google Inc. Tesseract version 3 and version 4 are used. Results show that deep-learning method outperforms the conventional methods in terms of recognition error rate, although the Tesseract_3.0 system was faster.

DEEP-LEARNING LONG SHORT-TERM MEMORY CONNECTIONIST TEMPORAL CLASSIFICATION TESSERACT ARABIC CHARACTER RECOGNITION OCR PERFORMANCE