
The digitization of historical documents is vital for preserving cultural heritage, yet mainstream OCR (Optical Character Recognition) systems often fail to support minority languages due to limited training data and language-specific models. This study explores how open-source OCR frameworks can be adapted to overcome these limitations, focusing on Finnish and Swedish as case studies. We present a practical methodology for fine-tuning PaddleOCR using a combination of manually annotated and synthetically generated data, supported by high-performance computing infrastructure. Our enhanced model significantly outperforms both Tesseract and baseline PaddleOCR, particularly in recognizing handwritten and domain-specific texts. The results highlight the importance of domain adaptation, GPU acceleration, and open-source flexibility in building OCR systems tailored for under-resourced languages. This work offers a replicable blueprint for cultural institutions seeking locally deployable OCR solution.

To this day, most important documents are still issued on paper. The security is based on the fact that the cost of creating a counterfeit must be unattractive for counterfeiters in relation to the expected profit. This results typically in using expensive printing equipment and substrate. This work introduces an approach which evaluates paper documents using any internet enabled device with a camera and a web browser like smartphones and tablets. Optical character recognition (OCR) is used to make text machine readable after the document is recognized and rectified. Digital signatures are then used to verify the authenticity and integrity of the data. Beyond that, the requirements of privacy, robustness and usability are satisfied. By using JAB Code, a high-capacity matrix code, the data to be verified can be stored directly on the document without having to use a database. This brings key advantages compared to database-bound systems in terms of security and privacy. The use of OCR achieves high usability.

Line segmentation performs a significant stage in the OCR systems; it has a direct effect on the character segmentation stage which affects the recognition rate. In this paper a robust algorithm is proposed for line segmentation for Arabic printed text system with and without diacritics based on finding the global maximum peak and the baseline detection. The algorithm is tested for different font sizes and types and results have been obtained from testing 5 types of fonts with total of 43,055 lines with 99.9 % accuracy for text without diacritics and 99.5% accuracy for text with diacritics.