Back to articles
Proceedings Paper
Volume: 22 | Article ID: 26
Image
Tweaking Mainstream Open-source OCR Engine for Minority Languages, How To?
  DOI :  10.2352/issn.2168-3204.2025.22.1.26  Published OnlineJune 2025
Abstract
Abstract

The digitization of historical documents is vital for preserving cultural heritage, yet mainstream OCR (Optical Character Recognition) systems often fail to support minority languages due to limited training data and language-specific models. This study explores how open-source OCR frameworks can be adapted to overcome these limitations, focusing on Finnish and Swedish as case studies. We present a practical methodology for fine-tuning PaddleOCR using a combination of manually annotated and synthetically generated data, supported by high-performance computing infrastructure. Our enhanced model significantly outperforms both Tesseract and baseline PaddleOCR, particularly in recognizing handwritten and domain-specific texts. The results highlight the importance of domain adaptation, GPU acceleration, and open-source flexibility in building OCR systems tailored for under-resourced languages. This work offers a replicable blueprint for cultural institutions seeking locally deployable OCR solution.

Subject Areas :
Views 0
Downloads 0
 articleview.views 0
 articleview.downloads 0
  Cite this article 

Tuomo , Anssi Jääskeläinen, Atte Föhr, "Tweaking Mainstream Open-source OCR Engine for Minority Languages, How To?in Archiving Conference,  2025,  pp 140 - 144,  https://doi.org/10.2352/issn.2168-3204.2025.22.1.26

 Copy citation
  Copyright statement 
Copyright ©2025 Society for Imaging Science and Technology 2025
archiving
Archiving Conference
2161-8798
2161-8798
Society for Imaging Science and Technology
IS&T 7003 Kilworth Lane, Springfield, VA 22151 USA