Optical character recognition (OCR) automatically recognizes texts in an image and converts them into machine codes such as ASCII or Unicode. Compared to many research studied on OCR for other languages, recognizing Arabic language is still a challenging problem due to character connection and segmentation issues. In this work, we propose a deep-learning framework of recognizing Arabic characters based on the multi-dimensional bi-direction long short-term memory (MD-BLSTM) with connectionist temporal classification (CTC). To train this framework, we generate over one-million Arabic text-line images dataset that contains Arabic digits, basic Arabic forms with isolated shape and connected forms. To compare the results, we also measure the performance of other OCR software such as Tesseract made by Hewlett-Packard and Google Inc. Tesseract version 3 and version 4 are used. Results show that deep-learning method outperforms the conventional methods in terms of recognition error rate, although the Tesseract_3.0 system was faster.
Daegun Ko, Changhyung Lee, Donghyeop Han, Hyeongsu Ohk, Kimin Kang, Seongwook Han, "Approach for Machine-Printed Arabic Character Recognition: the-state-of-the-art deep-learning method" in Proc. IS&T Int’l. Symp. on Electronic Imaging: Visual Information Processing and Communication IX, 2018, pp 176-1 - 176-8, https://doi.org/10.2352/ISSN.2470-1173.2018.2.VIPC-176